Wikipedia Data Analysis

1. Overview
2. The data
3. Your task
4. Parts 1 and 2
5. Additional details
6. Design and style
7. The last point
8. Submit your work
9. Important note regarding QRE

This is a pair programming assignment. If you are working in a pair, this means that you and your partner should be doing the entirety of this assignment side-by-side, on a single computer, where one person is "driving" and the other is "navigating." Set a timer to swap every 15 minutes. You can choose your favorite from this online timer page. Make sure your sound volume is audible, but not so loud to disturb the people around you.

If you are working in a pair, only one of you needs to submit your work via Moodle. That said, you should both have a copy of your work in case you want it someday, so make sure that both of you have copies of it; you can email it or use some other mechanism to transfer it.

We will use anonymous grading on Moodle, which means that the grader won't see your name until after the grading is done. This is an easy way to help add an extra element of fairness to the grading. Therefore, make sure your name doesn't appear on your actual submission. When you submit via Moodle, it will know you are. Thanks!

1 Overview

Wikipedia is an amazing and bizarre community-driven project where anyone can make edits to almost any article. If you haven't ever tried to edit a Wikipedia article before, you should do so – it's fun and interesting.

Wikipedia seems to have become the default and free encyclopedia of the world. This is amazingly cool, but also scary considering some of the flaws present in Wikipedia. There are dramatic inequities amongst the Wikipedia contributor community, and in the choices that they make. One example in particular is that only 9-13% of contributors to Wikipedia identify as women. Is this a problem? Some might argue that the gender of an author doesn't matter if the content is well-balanced. Is it?

For this project, you'll reproduce research I was involved in that looked at English Wikipedia articles of interest to contributors who identified as female and contributors who identified as men, and compared the lengths of those articles. (The data we had access to, unfortunately, did not have other non-binary gender options for people to select. Our work is admittedly incomplete regarding that matter, but the results are still worthwhile for what we were able to study.) Why is Wikipedia article length important? Other research has shown that article length correlates with article quality¹^,². While it is not a perfect predictor, using the length of an article is a good proxy for estimating how good an article is.

More specifically, for this project, you will attempt to determine if Wikipedia articles of interest to people who identify as male and female are of considerably different lengths.

2 The data

Determining the (approximate) length of a Wikipedia article is easy. That will be the last step of the work that you do. The challenging part is to determine which articles are more interesting to people who identify with a particular gender in a systematic and reproducible way. The obvious thing to do is to get this data from Wikipedia somehow, except that this is hard. Most Wikipedia editors did not historically supply their gender (this info in a user profile is optional), so there may be a strong "self-selection" bias amonst those who supply gender info and those who do not.

Instead, we chose to use gender information from MovieLens, which is a free online movie recommendation site. Over 80% of the users in MovieLens report their gender, (unlike Wikipedia, where only a small fraction of contributors report their gender). While it is possible that there is bias or innacuracy in MovieLens regarding its gender data, we believe that this would be much less likely than in Wikipedia.

We used MovieLens data to identify which movies should be of strongest interest to users who identify as women, and which movies should be of strongest interest to users who identify as men. We then compared the average lengths of those Wikipedia articles to look for a difference.

3 Your task

Replicate our research! Your job is to:

3.1 Obtain the data

Download the most recent MovieLens dataset with demographic information. The data is sadly somewhat data; more recent releases by the project have not included demographic information on the users. (When we did our research study, we had the advantage that one of our co-authors was on the MovieLens team, and had access to more recent data behind the scenes.)

The data that you download has a README file within it. After you unzip the data, read the README file to learn about how the data is stored.

3.2 Read the data into Python

Write a Python program to read the three files (movies.dat, ratings.dat, and users.dat) into Python. Think very carefully about how to store the information. For example, if you'll want to look up information in movies.dat by a movie id, you'll want to put the data into a dictionary keyed on movie id. Don't just start coding here; read through the rest of the assignment first, and think about what your algorithm will look like. You'll be looping through one set of data, and doing lookups on others. What are you looping over, and what are you looking up? You want to store your data appropriately to make this fast.

One detail: the movies.dat file has some international characters in it. The default technique for opening up a Python file generally assumes that the file is in UTF-8 format, which is the most popular modern technique for storing Unicode files. This file, however, is stored in the latin_1 format, which is a different binary approach for storing text. Here is some sample code to show you how to open up the movies file and get everything out of it without error:

movieFile = open('movies.dat', mode='r', encoding='latin_1')
for row in movieFile:
    row = row.rstrip()
    items = row.split('::')
    print(items)
movieFile.close()

3.3 Classify, rank, and assess the movies

Produce a list of the 20 most "male identified" movies and the 20 most "female identified" movies. Figuring out how to measure the genderedness of a movie is part of your task, and is not clear cut. Should you use the average rating by people from each gender, which measures how a movie was liked by each gender? Or should you use how often a movie was rated by each gender, regardless of whether the rating was positive or negative? This would measure what movies each gender chose to watch, regardless of the opinion they formed. For the top 20 female identified movies (and ditto for male), should you choose the movies that score the highest on whichever metric you choose for "femaleness"? Or should you choose the movies that have the highest difference between the female scores and the male scores? You'll need to argue the technique you choose. You might want to try more than one.

Once you have chosen your two lists of 20 movies, measure the length of the English Wikipedia article for each. This is hard to completely automate because the names of the movies in the MovieLens dataset don't precisely match to the names in Wikipedia. You'll have to manually search Wikipedia to find the names of the Wikipedia articles that match to each movie. Once you've done this, you can use or modify this program I wrote to measure the lengths of a series of Wikipedia articles

3.4 Write up your results

Summarize your results in a way that is meaningful. Submit a short paper (at least 800 words, plus tables of data or graphs) describing how you approached what you did, and what you learned. You can use whatever software you like to create this document, but you should submit it as a PDF. This is good practice for transmitting electronic work: sending word processor documents (such as Microsoft Word, etc) does not guarantee that your reader will see the layout in the same way that you do. You should ultimately submit both your Python program(s) and your paper.

4 Parts 1 and 2

In order to get you started on this assignment, there are actually two submissions you'll need to make. Part 2 is the final project, as described above. For Part 1, submit Python code which determines (and prints out) the number of people identified as males and the number of people identified as females, separately, that rated the movies "Free Willy (1993)", "Runaway Bride (1999)", and "Wag the Dog (1997)". These numbers in particular will help the graders determine if you are on the right track.

5 Additional details

There are undoubtedly other ways of solving this problem by using other tools than Python. The point of this assignment, however, is to learn how to use Python dictionaries and other structures in the context of a hopefully interesting problem. Don't do this assignment via some other magical tool. Excel has some really neat tricks that would make the Python program mostly unnecessary; but they would fail if the dataset had 10 million rows.
Research in a paper such as this one, in computer science, is typically written as to describe in detail the data used and the approach taken, and an analysis of the results. It does not include low-level details of the program itself, like "I looped over the data, and incremented a count of the number of movies that males watched." If you'd like to see some actual research papers I've written to give you a rough sense of what they might look like, check out this paper about mentoring in Wikipedia, and the actual gender paper on Wikipedia that we wrote. Both of these are considerably longer than the paper I'm asking you to write, and are at a considerably higher level; these were written by computer science faculty and graduate students. Still, they might be interesting to look at, and at least give you a rough sense of what the important parts of a paper such as this one might be.
Addendum to the above point: I've got mixed feelings about sharing the actual gender paper we wrote, because I don't want it to squelch your creativity. Don't use it as a source on how to make specific decisions on how to measure things. There are many decisions we made that were judgment calls; use your own judgment on those matters, rather than simply mimicking the choices that we made.
In order to get the lists of top 20 movies, you'll need to do some sorting. In Python, you can sort lists via the sort method. You may want to use a list of tuples containing movie value and title, so that it sorts by value and brings the titles along with it.

6 Design and style

Make sure that your program follows good design and style. See the guidelines from recent assignments. If you really want to learn lots about Python style, the official style guide for Python is a great read.

7 The last point

If you complete the above functions (and other things like style, etc. are correct), you will receive nearly all of the points for this assignment. You should feel proud and good about yourself that you have gotten this far, and feel free to stop here! If you want to try to earn the remaining point: look at the users.dat file, and observe that it also shows information on age, occupation, and zip code; also observe that the movies.dat file has information on genres. Use one of these other attributes about users or movies to discover something else within the data, and write about it in your report.

8 Submit your work

When finished, zip up your code and submit your work through Moodle.

Good luck, and have fun! Remember that lab assistants are available in the evenings in CMC 102 and CMC 306 to help out if you need it, and you can attend prefect sessions as well.

9 Important note regarding QRE

CS 111 counts as a Carleton Quantitative Reasoning Encounter. This assignment is the critical aspect of the course for fulfilling that requirement. Therefore, there is a specific additional grading stipulation for this assignment: you must turn in this assignment with passing-quality work, including the short paper listed within, in order to pass the course.

Footnotes:

J. E. Blumenstock. Size matters: Word count as a measure of quality on Wikipedia. In Proc. WWW 2008. ACM.

T. Wöhner and R. Peters. Assessing the quality of Wikipedia articles with lifecycle based metrics. In Proc. WikiSym 2009, New York, NY. ACM.