Final Results

Copy That: Detecting Plagiarism in Documents

Advisor: David Liben-Nowell

Background

I was trying to think of a a good working definition for "plagiarism", and what came to mind was this: Plagiarism is the wrongful appropriation and purloining and publication of another author's language, thoughts, ideas, or expressions, and the representation of them as one's own original work. Er—actually, that's what Wikipedia says plagiarism is. There have been a number of recent high-profile instances in which someone publishes a famous book or scientific article that turns out to have surprisingly close textual similarity to a previously published (sometimes less famous) book or article. Recent (alleged) examples have included How Opal Mehta Got Kissed, Got Wild, and Got a Life [NYT] and a gun control article in Time [NYT]. There's an excellent recent essay from the plagiarizee in a case of poetry: Sandra Beasley's Nice Poem, I'll Take it in the New York Times. And issues of academic integrity on college campuses, including Carleton, are often centered on issues of plagiarism.

Here's an example, from Sandra Beasley's Nice Poem, I'll Take it in the New York Times:

August, by Sandra Beasley
[in Theories of Falling, New Issues Press, 2008]
[found online at Hayden's Ferry Review]

July, "by" Christian Ward
[submitted to the Buxton Poetry Competition, 2011]
[found in Sandra Beasley's NYT essay Nice Poem, I'll Take it]

Sooner or later, the thing you value most will beg to be burned.
Trust me, says the phoenix, I'm immortal. Watch your childhood
home—how the wires fray, how the baseboards splinter to tinder.
Your nights are split open by the steam and the writhing of hoses.
Sooner or later, whatever you cherish most will beg to be burned.
Trust me, the phoenix says, I'm immortal. Watch your childhood
home—how the wires fray, how the floorboards splinter to tinder.
Your nights are spilt open by steam and the writhing of hoses.

The Project

In this project, you will build a system to detect use of text in a document (the "derived document") that also appears elsewhere in a corpus of documents (the "source document"). There are several types of repetition that you will explore. In increasing order of complexity, they are:

word-for-word duplication of a (long) sequence of words from a source document, without any intervening additions.
[syntactic modification] duplication of a (long) sequence of words from a source document, with some new material inserted, old material deleted, or old material edited in the derived document.
[paraphrasing] a "semantically similar" version of a (long) sequence of words from a source document, where some words or phrases replaced by synonyms or near synonyms.

The outline of the project's task will to (1) investigate existing plagiarism-detection systems; (2) investigate the literature on textual similarity, for example "fingerprinting techniques" in detecting duplicate web pages; (3) investigate algorithms for semantic similarity; (4) implement/adapt/extend these algorithms to the present case. Another form of plagiarism, outside the corpus, comes from the web. A nice extension of this project might be to find "the most unexpected (short) phrases" in a document that might be candidates for a query to an API for a web search engine.

You can expect to begin the project by identifying existing algorithmic techniques for detection of duplicated text from the literature. We will spend a significant portion of fall term with you identifying appropriate papers, reading some of them in sufficient detail to be able to explain them, and then reporting the techniques to the team. We will then begin to transition to implementing/extending these algorithms.

August, by Sandra Beasley [in Theories of Falling, New Issues Press, 2008] [found online at Hayden's Ferry Review]		July, "by" Christian Ward [submitted to the Buxton Poetry Competition, 2011] [found in Sandra Beasley's NYT essay Nice Poem, I'll Take it]
Sooner or later, the thing you value most will beg to be burned. Trust me, says the phoenix, I'm immortal. Watch your childhood home—how the wires fray, how the baseboards splinter to tinder. Your nights are split open by the steam and the writhing of hoses.		Sooner or later, whatever you cherish most will beg to be burned. Trust me, the phoenix says, I'm immortal. Watch your childhood home—how the wires fray, how the floorboards splinter to tinder. Your nights are spilt open by steam and the writhing of hoses.