Note to self: For next time, give more specific direction on lab report. Specifically, focus on analytical techniques and results, and not on how code was written. Lab report is perhaps a bad name, as that suggests too much about what functions, they chose to write, etc. Consider sharing some papers that I've written.
This is a team assignment.
Wikipedia is an amazing and bizarre community-driven project where anyone can make edits to almost any article. If you haven't ever tried to edit a Wikipedia article before, you should do so -- it's fun and interesting.
Wikipedia is rapidly becoming the default and free encyclopedia of the world. This is amazingly cool, but also scary considering some of the flaws present in Wikipedia. There are dramatic inequities amongst the Wikipedia contributor community, and in the choices that they make. One example in particular is that only 13% of contributors are women. Is this a problem? Some might argue that the gender of an author doesn't matter if the content is fine. What can we say about "female" vs. "male" content in Wikipedia?
For this project, you'll reproduce research I was involved in that looked at English Wikipedia articles of interest to women and articles of interest to men, and compared the lengths of those articles. Research has shown that article length correlates with article quality1,2. While it is not a perfect predictor, using the length of an article is a good proxy for estimating how good an article is.
More specifically, for this project, you will attempt to determine if Wikipedia articles of interest to men and women are of considerably different lengths.
Determining the (approximate) length of a Wikipedia article is easy. That will be the last step of the work that you do. The challenging part is to determine which articles are more interesting to males vs. females in a systematic and reproducible way. The obvious thing to do is to get this data from Wikipedia somehow, except that this is hard. Most Wikipedia editors do not supply their gender (this info in a user profile is optional), so there may be a strong "self-selection" bias amonst those who supply gender info and those who do not.
Instead, we chose to use gender information from MovieLens, which is a free online movie recommendation site. Over 80% of the users in MovieLens report their gender, (unlike Wikipedia, where only 2.8% of contributors report their gender). While it is possible that there is bias or innacuracy in MovieLens regarding its gender data, it seems as though this would be much less likely than in Wikipedia.
We used MovieLens data to identify which movies should be of strongest interest to women, and which movies should be of strongest interest to men. We then compared the average lengths of those Wikipedia articles to look for a difference.
Replicate our research! Your job is to:
In order to get you started on this assignment, there are actually two submissions you'll need to make. Part 2 is the final project, as described above. For Part 1, submit Python code which determines (and prints out) the number of males and the number of females, separately, that rated the movies "Free Willy (1993)", "Runaway Bride (1999)", and "Wag the Dog (1997)". These numbers in particular will help the graders determine if you are on the right track.
Good luck, and have fun! Remember that lab assistants are available in the evenings in CMC 306 to help out if you need it.
1J. E. Blumenstock. Size matters: Word count as a measure of quality on Wikipedia. In Proc. WWW 2008. ACM.
2T. Wöhner and R. Peters. Assessing the quality of Wikipedia articles with lifecycle based metrics. In Proc. WikiSym 2009, New York, NY. ACM.