The other day, I heard that there had been a fire set in the amazing Sagrada Família church in Barcelona, Spain. While reading an article about the fire on an English-language Spanish news website, I ran across this sentence:
a number fire engines and fire-fighting crews rushed to the scene and took three quarters of an hour to put out the blaze and lucid the smoke.
The uncapitalized first word and the missing "of" in "number fire engines" hinted that this might be an automatic translation of some article originally in Spanish or Catalan, but "put out the blaze and lucid the smoke" was the clincher. Lucidity is always welcome, but in this case, some simple clearing of smoke was probably called for.
As usual, there's a little bit of humor in this error, and a chance to roll one's eyes at the foolishness of computers. And yet, I found this particular article helpful and lucid (sorry). Even an imperfect translation can be useful.
As it happens, there is a well-known and widely-used approach to doing statistical machine translation, described quite clearly in several places. For this comps project, you're going to implement this approach.
The basic algorithms at the heart of statistical MT systems are within the scope of a comps project, but getting a system to be good enough for daily use is likely more difficult. One of the key elements of a statistical MT system is a large collection of parallel translations--that is, documents that say the same things in two or more languages. Fortunately for our purposes, there are large collections of parallel translations available from the United Nations and the European Union. How well will an MT system based on these documents work? That's one of the things I am excited to find out!
For this project, you will develop a statistical machine translation system, a mechanism for testing it, and a user interface to make your system easy to play with and test. You will also undoubtedly generate your own collection of amusing translation errors along the way.
Some steps you may wish to follow include:
In the fall, you'll work with a librarian to do a thorough literature search to find out what others have done in this area. In the meantime, here are a few relevant resources.