Traditional methods of discovery for civil lawsuits take too many human hours. Reading hundreds of thousands of documents is inefficient and prone to human error. Every document is either relevant or not relevant to a case, but needs to be read and understand by a lawyer or paralegal for that to happen.
E-Discovery uses machine learning algorithms to group documents and decide on them as either relevant or not relevant with minimal human input.
After Enron collapsed, the Federal Energy Regulation Commission gathered a corpus of approximately 600,000 emails from the company. After some investigation, they released the dataset to the public. This dataset has become the gold standard for E-Discovery, since it contains real internal documents from a company. Real emails similar to the Enron email corpus are incredibly hard to come by due to privacy concerns. In addition to the actual dataset, we used labels for fictitious scenarios generated in 2011 by the Text Retrieval Conference (TREC) as a gold standard against which to compare ourselves.
Search and filter emails by sender, recipient, date, and subject. Walk through an example scenario we constructed for finding emails about lunch at Enron. Built with Vue.js and Bulma CSS.
Broadly speaking, our pipeline consists of two steps: natural language processing and a random forest. We used Latent Semantic Analysis (LSA) to generate a number of topics based on the contents of the Enron emails. LSA also provides us with information about which topics are important to which emails. These topics, combined with metadata about the senders and recipients, provided us with our features to use in the random forest. Random forest is a popular machine learning algorithm that uses many decision trees to come to a classification decision while avoiding overfitting.
Explore how our machine learning algorithms decided on each email on the front end. The importance of each topic and the contents of each topic and readily displayed for every email.
Our results were competitive with the results from the TREC conference. For a text-retrieval task such as this one, recall (the proportion of relevant documents that we successfully found) is the most important number. For the scenario we worked with primarily, our results were significantly better than the teams that participated in the conference. The team with the best recall score at the conference, 96%, had an overall F1 score of .17, compared to our F1 of .78. However, the teams at the conference performed better in two other scenarios, where our recall scores were more disappointing. With more time and parameter tuning, we believe that we can get results comparable to the industry standard across all three scenarios.