Hello! Want to understand SMT?
SMT is not a compilation of translation/grammar rules. While this is a valid strategy for Machine Translation, known as "Rule-based MT", it is not to be confused with SMT. It also has a few pitfalls:
- Difficult to design and implement
- Requires extensive linguistic knowledge of every language to translate
- Does not scale well
SMT instead makes use of statistical models to achieve our goal. We train these models on existing translations, produced by humans. Thus, SMT is an example of Machine Learning (under the broader category of Artificial Intelligence), in the sense that it "learns" how to translate betwen languages by studying hand-produced translations of (hopefully) good quality. Our system uses these statistical models to evaluate the probability of sentence translations, with the ultimate goal of finding the "most likely" translation for an input sentence.
Let us define a few terms:
P(E|S) = "the probability of an English sentence E given a Spanish sentence S"
argmax_E(P(E|S))* = "the English sentence E that maximizes the above probability"
*This is our goal for Spanish to English translation
We decompose the problem of determining sentence probability using Bayes' Rule:
P(E|S) = P(S|E) * P(E) / P(S)
This allows us to describe our overall goal as:
argmax_E(P(S|E) * P(E))
Note that the P(S) term has dropped out. Since the input sentence S is fixed, its associated probability P(S) is also fixed for any sentence E, and is therefore irrelevant for the maximization.
We can now define the components of our system:
- Language Model — P(E)
- Translation Model — P(S|E)
- Decoder — argmax
Language Model
The goal of this component is to determine how good/likely an English sentence is: P(E). You can think of this as the "Englishiness" of a sentence. Ungrammatical English sentences should have low probabilities. Grammatical sentences that convey little or no meaning will do better. Grammatical sentences with a clear meaning will do better yet. Grammatical sentences with a meaning, along with common vocabulary and phrasing will do best of all. At least, that's the idea.
In practice, we use a Bi-gram model, which determines the probability of a sentence by examining all of the adjacent words in that sentence. For example, if we want to evaluate the sentence "Hello, how are you?", it would look like P("hello how") * P("how are") * P("are you"). We're ignoring punctuation and capitalization for simplicity.
In order to evaluate the probability of these two-word phrases, we train on our corpus, which is the collection of known translations that our system uses for learning. Note that while the corpus is represented by a series of paired sentences, we will only need sentences of a single language for this component. The corpus trains these Bi-gram probabilities by simply assessing what proportion of occurences of the first word are followed by the second word. For example, if "hello" appears 100 times in our corpus, and it is followed by "how" only 5 of those times, P("hello how") will be 5/100 = .05.
Translation Model
This component has the more challenging task of determining how likely it is that two sentences are a good translation of each other. Note that this problem is exactly as hard as our original problem (P(E|S)). In fact, it is the same problem, but with the direction switched around. We are evaluating how likely a Spanish sentence is given an English sentence. So as not to confuse ourselves with the overall translation, which is in the other direction, we refer to this probability as the probability of "E generating S".
Just as in the language model, we will break this overall probability down into smaller, more digestable probabilities, and gather those from the corpus. This time, however, we will consider all of the corpus, i.e. both languages instead of just one. The lower-level probabilities involved in generation are these:
- Fertility - the probability of some English word e generating N Spanish words
- Word translation - the probability of e translating to some Spanish word s
- Alignment/distortion - the probability of the ith word in E generating the jth word in S
The training procedure for these probabilities is much more complicated than Bi-grams, so we will not get into the details here. In short, we begin with uniform probabilities for all values, then make a sweep of the corpus, adjusting the probabilities based on what we see. This is an iterative process, which we repeat a number of times until the probabilities converge sufficiently.
Decoder
If the previous two components are functioning and have been trained from the corpus, we have the ability to calculate P(S|E) and P(E). This component is responsible for searching English sentences to find the one that maximizes the product of these probabilities. In practice, it is too costly to search all English sentences beyond a very short length, so we will approximate the global optimum.
Algorithms to solve this search problem tend to be among the most complex and challenging parts of SMT. We used two two strategies in order to compare their relative merits.
The Dynamic Programming (DP) decoder constructs a DP table of partial sentences, building up to some target length. It leverages the language model and translation model for its decisions along the way, and to assess the probabilities of the partial sentences.
The Alternating Search (AS) decoder alternates between two easier problems. First, finding a word-to-word alignment between S and E that maximizes probability. Second, finding the most probable E given both S and some alignment. Note that this second problem is almost the same as the DP decoding problem (and will also be solved using DP), but knowing the alignment at the outset simplifies things considerably.
If you've made it this far, you are now essentially an expert on SMT. However, if you still want to know more, head over to the Downloads page to find our source code, and a presentation on the subject.