Advisor: Eric Alexander
Companies and other entities involved in litigation are often required by the courts to produce all documents relevant to the case as evidence. What is “relevant” can be rather difficult to determine, as requests may be imprecise and a document’s contents may be hard to quantify. A company does not want to present any documents it does not have to, and there may be some documents it legally cannot present (e.g., to maintain the privacy of its customers or employees). However, there can be serious consequences for withholding documents later deemed to be applicable to the case.
Sifting through massive sets of electronic documents and determining which ones make the cut is the focus of the multibillion dollar industry of electronic discovery (or “e-discovery”). This has often been performed by humans (expensive ones, with advanced degrees!) painstakingly scanning through document metadata and contents for signs of relevance. This becomes easier with the use of keyword search and other filtering mechanisms to trim the potential matches, but can still be a difficult and error-prone process.
Better (and more interesting for us!) is the process of technology-assisted review (TAR), which seeks to replace the human process with one of supervised machine learning. After specifying some initial constraints to give the algorithm a starting point, reviewers using such techniques are able to iteratively provide the algorithm with labels (e.g., “this document is relevant,” “this one isn’t”) so as to train it to identify relevant documents on its own. Such techniques can save a firm both time and money, and generally result in more accurate final assessments.
In this project, you will investigate some of the competing methods that are used for TAR in e-discovery and compare their performance on a sample dataset made up of documents and emails retrieved from Enron as part of investigation of its famous scandal.
In particular, you will:
This will involve an extensive look into what kinds of algorithms have been successful at this task. Apart from implementing these algorithms themselves, you will also need to extract features of the documents in question that the algorithms can use to determine their relevance. There are many different types of features that may be useful or important, ranging from keywords and phrases (“n-grams”) to more context-aware features that can be retrieved using natural language processing. Ultimately, it will be important to create mechanisms for evaluating the results you get such that different methods can be compared against one another.
Experience with machine learning or related algorithms of this kind (as might be received in courses starting with CS32X) may be useful, as might experience working with large datasets, but neither is required.