Wikipedia Data Analysis
This is a team assignment.
Wikipedia is an amazing and bizarre community-driven project where
anyone can make edits to almost any article. If you haven't ever
tried to edit a Wikipedia article before, you should do so -- it's fun
and interesting.
Wikipedia is rapidly becoming the default and free encyclopedia of the
world. This is amazingly cool, but also scary considering some of the flaws
present in Wikipedia. There are dramatic inequities amongst the Wikipedia
contributor community, and in the choices that they make. One example in
particular is that only 13% of contributors are women. Is this a problem? Some
might argue that the gender of an author doesn't matter if the content is
fine. What can we say about "female" vs. "male" content in Wikipedia?
For this project, you'll reproduce research I was involved in that looked at
English Wikipedia articles of interest to women and articles of interest to men, and compared
the lengths of those articles. Research has shown that article length
correlates with article quality1,2. While it is not a perfect
predictor, using the length of an article is a good proxy for estimating how
good an article is.
More specifically, for this project, you will attempt to determine if
Wikipedia articles of interest to men and women are of considerably different
lengths.
The data
Determining the (approximate) length of a Wikipedia article is easy. That
will be the last step of the work that you do. The challenging part is to
determine which articles are more interesting to males vs. females in a
systematic and reproducible way. The obvious thing to do is to get this data
from Wikipedia somehow, except that this is hard. Most Wikipedia editors do not
supply their gender (this info in a user profile is optional), so there may be a
strong "self-selection" bias amonst those who supply gender info and those who
do not.
Instead, we chose to use gender information from MovieLens, which is a free
online movie recommendation site. Over 80% of the users in MovieLens report
their gender, (unlike Wikipedia, where only 2.8% of contributors report their
gender). While it is possible that there is bias or innacuracy in MovieLens regarding its gender data, it seems as though this would be much less likely than in Wikipedia.
We used MovieLens data to identify which movies should be of strongest
interest to women, and which movies should be of strongest interest to men. We
then compared the average lengths of those Wikipedia articles to look for a
difference.
Your task
Replicate our research! Your job is to:
- Download the
most recent MovieLens dataset with demographic information. The data is a
little on the old side; alas, more recent releases by the project have not
included demographic information on the users. (When we did our research study,
we had the advantage that one of our co-authors was on the MovieLens team, and
had access to more recent data behind the scenes.)
- The data that you download has a README file within it. After you unzip
the data, read the README file to learn about how the data is stored.
- Write a Python program to read the three files (movies.dat, ratings.dat,
and users.dat) into Python. Think very carefully about how to store the
information. For example, if you'll want to look up information in movies.dat
by a movie id, you'll want to put the data into a dictionary keyed on movie
id. Don't just start coding here; read through the rest of the assignment
first, and think about what your algorithm will look like. You'll be looping
through one set of data, and doing lookups on others. What are you looping
over, and what are you looking up? You want to store your data appropriately
to make this fast.
- Produce a list of the 20 most "male" movies and the 20 most "female"
movies. Figuring out how to measure the genderedness of a movie is part of
your task, and is not clear cut. Should you use the average rating by people
from each gender, which measures how a movie was liked by each gender? Or
should you use how often a movie was rated by each gender, regardless of
whether the rating was positive or negative? This would measure what movies
each gender chose to watch, regardless of the opinion they formed. For the top
20 female movies (and ditto for male), should you choose the movies that score
the highest on whichever metric you choose for "femaleness"? Or should you
choose the movies that have the highest difference between the female scores
and the male scores? You'll need to argue the technique you choose. You might
want to try more than one.
- Once you have chosen your two lists of 20 movies, measure the length of
the English Wikipedia article for each. This is hard to completely automate
because the names of the movies in the MovieLens dataset don't precisely match
to the names in Wikipedia. You'll have to manually search Wikipedia to find
the names of the Wikipedia articles that match to each movie. Once you've done
this, you can use or modify this program I
wrote to measure the lengths of a series of Wikipedia articles.
- Summarize your results in a way that is meaningful. Submit a short paper
(perhaps 3 pages or so, including tables of data or graphs) describing how you
approached what you did, and what you learned. You can use whatever software
you like to create this document, but you should submit it as a PDF. This is
good practice for transmitting electronic work: sending word processor
documents (such as Microsoft Word, etc) does not guarantee that your reader
will see the layout in the same way that you do.
- You should ultimately submit both your Python program(s) and your paper.
Parts 1 and 2
In order to get you started on this assignment, there are actually two
submissions you'll need to make. Part 2 is the final project, as described
above. For Part 1, submit Python code which determines (and prints out) the
number of males and the number of females, separately, that rated the movies
"Free Willy (1993)", "Runaway Bride (1999)", and "Wag the Dog (1997)". These
numbers in particular will help the graders determine if you are on the right
track.
Closing notes
- There are undoubtedly other ways of solving this problem by using other
tools than Python. The point of this assignment, however, is to learn how to use
Python dictionaries and other structures in the context of a hopefully
interesting problem. Don't do this assignment via some other magical tool. Excel
has some really neat tricks that would make the Python program mostly
unnecessary; but they would fail if the dataset had 10 million rows.
- A research in paper such as this one, in computer science, is typically
written as to describe in detail the data used and the approach taken, and an
analysis of the results. It does not include low-level details of the program
itself, like "I looped over the data, and incremented a count of the number of
movies that males watched." If you'd like to see some actual research papers
I've written to give you a rough sense of what they might look like, check out
this paper about mentoring in
Wikipedia, and the actual
gender paper on Wikipedia that we wrote. Both of these are considerably
longer than the paper I'm asking you to write, and are at a considerably
higher level; these were written by computer science faculty and graduate
students. Still, they might be interesting to look at, and at least give you a
rough sense of what the important parts of a paper such as this one might
be.
- Addendum to the above point: I've got mixed feelings about sharing the
actual gender paper we wrote, because I don't want it to squelch your
creativity. Don't use it as a source on how to make specific decisions on how
to measure things. There are many decisions we made that were judgment calls;
use your own judgment on those matters, rather than simply mimicking the
choices that we made.
- In order to get the lists of top 20 movies, you'll need to do some
sorting. In Python, you can sort lists via
the sort method. You
may want to use a list of tuples containing movie value and title, so that it sorts by
value and brings the titles along with it.
Good luck, and have fun! Remember that lab assistants are available
in the evenings in CMC 306 to help out if you need it.
References
1J. E. Blumenstock. Size matters: Word count as a measure of
quality on Wikipedia. In Proc. WWW 2008. ACM.
2T. Wöhner and R. Peters. Assessing the quality of Wikipedia
articles with lifecycle based metrics. In Proc. WikiSym 2009,
New York, NY. ACM.