Carleton Comps Project

Final Results

Topic Modeling of Latin Text Your Health

Advisor: Eric Alexander

Background

In our modern era, we have access to a huge amount of data in the form of text. Much of this is being actively created (e.g., websites, online articles, tweets, etc.), but there are also large-scale efforts to digitize historical texts. Such efforts are often focused primarily on access--providing the materials to wider audiences across new mediums--but they also afford new types of analysis into these often studied documents.

Topic modeling is a suite of algorithms used to extract semantically related sets of words from large text corpora. These sets of words (“topics”) are defined to be groups that tend to appear together in most of the same documents. Once they have been extracted, documents can be represented by their proportions of these topics, giving researchers a way of summarizing their contents while also providing a lower dimensional space that can be used for clustering, classification, and more.

Modeling of this kind is becoming increasingly popular across a variety of domains, including the digital humanities. Building topic models of historical data adds its own set of challenges, including greater variation in spelling, inconsistent proper nouns, imperfect digitization, and more. In this project, we will introduce yet another set of challenges: doing it in Latin!

An excerpt of Bede’s Historia ecclesiastica gentis Anglorum (written around 731 AD)

The project

In this project, you will study competing methods of topic modeling and implement a system for generating topic models. While you will test this system on a variety of different datasets, you will be primarily working with History Professor Austin Mason to develop topic models built on the Venerable Bede’s Historia Ecclesiastica Gentis Anglorum. This text, written in a structured format modeled on the Bible and Classical historical texts, covers a vast swath of English history spanning from Caesar’s invasion in 55 BC all the way up to the eighth century. It is one of the most cited and influential historical references of its era. Professor Mason has a highly annotated digitized and lemmatized version of the text that he hopes to use to gain different insight into Bede’s goals, biases and preoccupations than has been achieved by those using more traditional methods.

Part of your investigation will be into the degree to which the metadata that Professor Mason has collected can be incorporated into the modeling process. Does including (or excluding) certain kinds of words affect the quality of the topics being created? On the presentation side, it will be important to create mechanisms for accessing this metadata so that it can inform readers’ analysis.

As part of this project, you will:

Research different methods for creating topic models (there are many), and select a subset of them to implement. You will be building the modeling tools yourselves--this project will not be accomplished with a simple import nltk.
Come to understand some of the oddities of non-standard languages like medieval latin, so that they can be accounted for in the modeling process.
Investigate and apply different methods of evaluating the models you create. You will use these methods to evaluate models against each other.
Create a tool or mechanism to allow Professor Mason to explore the topics that are extracted, and see their relevance to the text itself.

Recommended experience

Experience with natural language processing would be helpful, but is not required. Other potentially useful courses include Artificial Intelligence, Data Mining, Computational Models of Cognition, linear algebra, and statistics.