Wikipedia Demographic Analysis

Wikipedia Data Analysis

This is a team assignment.

Wikipedia is an amazing and bizarre community-driven project where anyone can make edits to almost any article. If you haven't ever tried to edit a Wikipedia article before, you should do so -- it's fun and interesting.

Wikipedia is rapidly becoming the default and free encyclopedia of the world. This is amazingly cool, but also scary considering some of the flaws present in Wikipedia. There are dramatic inequities amongst the Wikipedia contributor community, and in the choices that they make. One example in particular is that only 13% of contributors are women. Is this a problem? Some might argue that the gender of an author doesn't matter if the content is fine. What can we say about "female" vs. "male" content in Wikipedia?

For this project, you'll reproduce research I was involved in that looked at English Wikipedia articles of interest to women and articles of interest to men, and compared the lengths of those articles. Research has shown that article length correlates with article quality^1,2. While it is not a perfect predictor, using the length of an article is a good proxy for estimating how good an article is.

More specifically, for this project, you will attempt to determine if Wikipedia articles of interest to men and women are of considerably different lengths.

The data

Determining the (approximate) length of a Wikipedia article is easy. That will be the last step of the work that you do. The challenging part is to determine which articles are more interesting to males vs. females in a systematic and reproducible way. The obvious thing to do is to get this data from Wikipedia somehow, except that this is hard. Most Wikipedia editors do not supply their gender (this info in a user profile is optional), so there may be a strong "self-selection" bias amonst those who supply gender info and those who do not.

Instead, we chose to use gender information from MovieLens, which is a free online movie recommendation site. Over 80% of the users in MovieLens report their gender, (unlike Wikipedia, where only 2.8% of contributors report their gender). While it is possible that there is bias or innacuracy in MovieLens regarding its gender data, it seems as though this would be much less likely than in Wikipedia.

We used MovieLens data to identify which movies should be of strongest interest to women, and which movies should be of strongest interest to men. We then compared the average lengths of those Wikipedia articles to look for a difference.

Your task

Replicate our research! Your job is to:

Download the most recent MovieLens dataset with demographic information. The data is a little on the old side; alas, more recent releases by the project have not included demographic information on the users. (When we did our research study, we had the advantage that one of our co-authors was on the MovieLens team, and had access to more recent data behind the scenes.)
The data that you download has a README file within it. After you unzip the data, read the README file to learn about how the data is stored.
Write a Python program to read the three files (movies.dat, ratings.dat, and users.dat) into Python. Think very carefully about how to store the information. For example, if you'll want to look up information in movies.dat by a movie id, you'll want to put the data into a dictionary keyed on movie id. Don't just start coding here; read through the rest of the assignment first, and think about what your algorithm will look like. You'll be looping through one set of data, and doing lookups on others. What are you looping over, and what are you looking up? You want to store your data appropriately to make this fast.
Produce a list of the 20 most "male" movies and the 20 most "female" movies. Figuring out how to measure the genderedness of a movie is part of your task, and is not clear cut. Should you use the average rating by people from each gender, which measures how a movie was liked by each gender? Or should you use how often a movie was rated by each gender, regardless of whether the rating was positive or negative? This would measure what movies each gender chose to watch, regardless of the opinion they formed. For the top 20 female movies (and ditto for male), should you choose the movies that score the highest on whichever metric you choose for "femaleness"? Or should you choose the movies that have the highest difference between the female scores and the male scores? You'll need to argue the technique you choose. You might want to try more than one.
Once you have chosen your two lists of 20 movies, measure the length of the English Wikipedia article for each. This is hard to completely automate because the names of the movies in the MovieLens dataset don't precisely match to the names in Wikipedia. You'll have to manually search Wikipedia to find the names of the Wikipedia articles that match to each movie. Once you've done this, you can use or modify this program I wrote to measure the lengths of a series of Wikipedia articles.
Summarize your results in a way that is meaningful. Submit a short paper (perhaps 3 pages or so, including tables of data or graphs) describing how you approached what you did, and what you learned. You can use whatever software you like to create this document, but you should submit it as a PDF. This is good practice for transmitting electronic work: sending word processor documents (such as Microsoft Word, etc) does not guarantee that your reader will see the layout in the same way that you do.
You should ultimately submit both your Python program(s) and your paper.

Parts 1 and 2

In order to get you started on this assignment, there are actually two submissions you'll need to make. Part 2 is the final project, as described above. For Part 1, submit Python code which determines (and prints out) the number of males and the number of females, separately, that rated the movies "Free Willy (1993)", "Runaway Bride (1999)", and "Wag the Dog (1997)". These numbers in particular will help the graders determine if you are on the right track.

Closing notes

There are undoubtedly other ways of solving this problem by using other tools than Python. The point of this assignment, however, is to learn how to use Python dictionaries and other structures in the context of a hopefully interesting problem. Don't do this assignment via some other magical tool. Excel has some really neat tricks that would make the Python program mostly unnecessary; but they would fail if the dataset had 10 million rows.
A research in paper such as this one, in computer science, is typically written as to describe in detail the data used and the approach taken, and an analysis of the results. It does not include low-level details of the program itself, like "I looped over the data, and incremented a count of the number of movies that males watched." If you'd like to see some actual research papers I've written to give you a rough sense of what they might look like, check out this paper about mentoring in Wikipedia, and the actual gender paper on Wikipedia that we wrote. Both of these are considerably longer than the paper I'm asking you to write, and are at a considerably higher level; these were written by computer science faculty and graduate students. Still, they might be interesting to look at, and at least give you a rough sense of what the important parts of a paper such as this one might be.
Addendum to the above point: I've got mixed feelings about sharing the actual gender paper we wrote, because I don't want it to squelch your creativity. Don't use it as a source on how to make specific decisions on how to measure things. There are many decisions we made that were judgment calls; use your own judgment on those matters, rather than simply mimicking the choices that we made.
In order to get the lists of top 20 movies, you'll need to do some sorting. In Python, you can sort lists via the sort method. You may want to use a list of tuples containing movie value and title, so that it sorts by value and brings the titles along with it.

Good luck, and have fun! Remember that lab assistants are available in the evenings in CMC 306 to help out if you need it.

References

¹J. E. Blumenstock. Size matters: Word count as a measure of quality on Wikipedia. In Proc. WWW 2008. ACM.

²T. Wöhner and R. Peters. Assessing the quality of Wikipedia articles with lifecycle based metrics. In Proc. WikiSym 2009, New York, NY. ACM.