Project Overview

Wikipedia^[1] is a well known and widely used source of information. Wikipedia is massive^[2], and its information is largely crowdsourced. As a result it is impossible^{[citation needed]} to manually evaluate the continuously growing number of articles. Even so, Wikipedia users should be able to check the quality/credibility of the articles that they are using. Thus, we propose Wikiscore: an automated version of the evaluation process that will give users a quality score and a confidence score for each article they want to use.

In order to make the problem tractable for our timeframe, we won't be retrieving articles from Wikipedia itself. The main, English version of Wikipedia sports over 6 million articles^[2]. To reduce computing time, we will instead use the Simple English Wikipedia, a condensed and simplified version of Wikipedia, that contains just over 200,000 articles^[3].

--------------------------------------------------

Vision

Wikipedia is a well known and widely used source of information. Wikipedia is massive, and its information is largely crowdsourced. As a result it is impossible to manually evaluate the continuously growing number of articles. Even so, Wikipedia users should be able to check the quality/credibility of the articles that they are using. We propose an automated version of the evaluation process that will give users a quality score for each article they want to use. There will be a breakdown of information about the article that contributed to the final score presented alongside the final score so that users are not left in the dark on the contributing metrics. For the first term of this project, our team will make sure that we can run simple algorithms on Wikipedia articles and then store our results in a database that can be queried later. The second term will be used to explore more complicated methods for evaluating the articles. While other projects have been done with the goal of automating the evaluation of Wikipedia articles, we hope that our deep dive into sentiment analysis and article and author connectivity graph analysis for evaluation will set our project apart. While other projects focus more heavily on the UI that displays the evaluation of a Wikipedia article to the user, we will focus more on the methods used to generate those results.

Overview

There will be three phases of our project. In the first phase of our project, we built a local implementation of the Wikipedia-page analysis system. This involved building data structures to support storage of the Wikipedia article (and its metadata), writing and implementing algorithms to classify the different parts of the article, and then combining the reported scores into an overview. Once reports could be generated per article, the same type of report was set up for each revision of the given article.

In the second phase of our project, we focused on taking our local implementation, and tying it to a database system, PostgreSQL, so that page (and revision) scores could be stored and retrieved instead of recalculated each time.

In the final phase of our project we implemented more complex algorithms regarding author connectivity to rate quality and compared them to the relatively simple method put forward in the Wikilyzer article as well as experimented with potential Natural Language Processing based algorithms.

Details

In order to make the problem tractable for our timeframe, we did not retrieve articles from Wikipedia itself. Since Wikipedia sports over 6 million articles, we used Simple Wikipedia, a condensed and simplified version of Wikipedia to reduce computing time. Unfortunately, the prebuilt Pythonic API wrapper that provides easy use of Wikipedia's API doesn’t allow pulling from Simple Wikipedia, so we made MediaWiki API calls to pull from Simple Wikipedia. Once the article was read in, the following algorithms/equations were used to determine the score for quality and confidence of the article. The textstat library provided methods for computing some of the simpler scoring metrics, such as the Flesch-Kincaid reading level, the Gunning FOG formula, and others. We also focused on incorporating article statistics into the scoring mechanism.

In addition to the metrics that the paper put forward we also tested the effectiveness of some NLP techniques in determining the quality of an article. Two high quality articles in different fields will have unique words specific to their topic, but the overall structure of the articles and sections will be similar. We plan to use the python library spacy to tag articles of varying quality and then find the similarities and differences between them, giving us another potential quality metric. Another NLP based approach we would like to try is Domain Informativeness. There are certain specific vocabulary that relate directly to their topic. While most likely unavailable for the entirety of Wikipedia, we are going to select a certain category of articles and build a dictionary that is specific to that category. Then, based on the density of those words and how informative they are, produce a rating score. Since Wikipedia articles are intended to be unbiased sources of information, it makes sense to also analyze sentiment within articles: Articles with higher levels of subjectivity (e.g. “New York is the best state to eat out in”) should be scored lower -- especially if the hyperbole is uncited.

Moreover, Wikipedia articles that link to other (well-written) Wikipedia articles within the same category, topic, or theme, are more likely to provide a thorough overview of the subject matter. Articles that are edited by reputable authors that often edit articles in a given category, topic, or theme, are also more likely to to contain useful information. Analyzing article and author connectivity based on graph theory representations of the connections between articles and authors allowed us to score and aggregate the aforementioned metrics.

PostgreSQL was used as a relational database. It is able to scale vertically to make querying easier and to allow growth as needed as more users score and save new articles to the database. The schematics for the database were made on dbdiagram.io and exported to PostgreSQL. The program only generated scores for articles as a whole and, given the time constraint, we were unable to score individual article sections. The database is structured to allow for the individual scoring of sections down the road without any changes being made to the data structure used to store the articles in our database. Essentially, the database allows for high vertical scalability.

Psycopg2 was used to get data from the database for the Python program to run the algorithms/equations on. Finally, for the website and application, we used Flask and HTML for displaying and formatting the article’s information. We also used a Python backend for fetching data, calculating the scores, and displaying the results on the webpage.

Literature Review

“WikiLyzer: Interactive Information Quality Assessment in Wikipedia” by Cecilia di Sciascio (et al.) is our main inspiration for this project. These researchers designed their quality metric by scraping wikipedia articles and computing the relevance of different attributes of the articles. The consistency of those results were examined and then combined to output a score for the overall accuracy of the information in the article. They assume that good articles must have some similarities across the board, and by combining those similarities were able to effectively rank the quality of information.

“Quality Assessment of Wikipedia Articles Without Feature Engineering” by Quang-Vinh Dang and Claudia-Lavinia Ignat took an approach very different to ours. Rather than identify features to look for within a wikipedia article, they created a deep neural network using tensorflow to analyze a set of wikipedia articles and learn what makes a good article on its own. They designed their approach to avoid the downsides of the bag of words approach as the network takes into account the ordering and semantics of a word, not just its existence. This is an interesting angle to approach the problem from, however the implementation of a DNN is out of the scope of our project, given our timeframe.

“Measuring article quality in Wikipedia: models and evaluation” by Meiqun Hu (et al.) takes an author and review focused approach to defining article quality. Each word has a unique author and some set of reviewers, then based on how many words in the article have been written and reviewed by “good” authors, the better the article is. The author based approach is interesting and given time we may attempt to implement their BASIC algorithm however this approach also relies on being able to accurately evaluate the credibility of all authors which brings up similar issues to the ones we are trying to address when automating the evaluation of an article.

Finally, “Assessing Information Quality of a Community-Based Encyclopedia” (Stvilia et. al) also explored additional quality metrics (formed by combining and weighting multiple article statistics). They test and explore the effectiveness of their selected quality metrics. This paper additionally suggests tracking and analyzing article connectivity, something we are interested in exploring per article.

Coordination

Coordination was carried out between sub teams and individuals over text and Slack. Team meetings with Dave were every Tuesday and without Dave every Thursday. Extra meetings were added as necessary and include any subset of the team that needed to communicate their progress in person over the weekend.

Subteams for term 1 were Nathan and David in Team 1 and Phil and Louis in Team 2. Team 1 worked on the python script while Team 2 set up the database in parallel. Once both teams were ready they came together to integrate the python script with a server and the database for the second portion of the project.

For term 2, sub teams changed so that Nathan and Phil were on Team 1 working on Author / article connectivity research. David and Louis will be on Team 2, working on Sentiment Analysis Implementation. Subteams dissolved after week 6 so that focus could be shifted to the presentation as a group.

Final Product

As a final product, we have a locally hosted website capable of retrieving a (Simple) Wikipedia article (by ID, or disambiguated name) and producing a score summary. Because the evaluations of the wiki articles that we are trying to produce are addressing the article’s quality, it is hard to confirm whether or not the evaluation is accurate. In order to verify that our application is working or not we need to compare it to other evaluations done by similar, existing applications. This is partially done with the connectivity graphs that take an author's success in the past into account based on the manual grading of Wikipedia articles done by WIkipedia but more comparisons can be made to determine the accuracy of the generated scores.

We have a completed database system that stores article scores along with other article and revision metrics. We also have a completed website where the user can specify a (Simple) Wikipedia article and see the metrics we calculated for that article. These metrics are displayed in a simple table format, a graph of nodes to display article and author graph connections (a more advanced metric).