Is Machine translation solved?
The authors have made use of Transformer as the base model. It doesn’t suffer from the ‘bottleneck problem’ of sequence2sequence models. Transformers are inherently parallelizable, compared to RNNS, hence faster training on larger amounts of data. They introduce the idea of Back Translation, highlighting the dual nature of problem. Back translation helps augment Bilingual(labeled) data with monolingual data available.Namely equation 7 and 8 of the paper, appeared to me as the most significant contribution. Joint training of S2T and T2S implies we begin with two pre-trained models for S2T and T2S. Ideally Since L2R and R2L models are different chain decompositions of the same translation probability, but naively they are not the same. Hence they train on minimizing this loss. Deliberation networks as a two pass process which can refine the translation being produced by the decoder. This was added to improve on the exposure bias problem. Or alternatively Agreement Regularization which adds two KL divergence terms for models(parameters) learned from either direction. Gradient approximation for KL divergence by sampling is a good technical contribution. For the optimum combinations of different systems involved, they used BATCH MIRA. Lastly they took great caution to avoid noise in data selection and used an encoder to get bilingual representation for data, and then chose relevant data.
The system is evaluated by using using human annotators,who use slider to rate the quality of a translation. There is no reference sentence to bias them. I liked their idea of guarding against confounding, and annotator errors and annotator consistency. They added redundancy at all levels ,and averaged out results to make sure there is no luck factor involved. Though they don’t mention the performance specifically on long sentences , which they have claimed to solve because of their L2R and R2L agreement regularization approach.
Longer sentences are more likely to suffer from exposure bias. So any improvement in this area, has to be shown with performance on increasingly longer sentences. 70 words is the maximum length of the sentence in training used.
Human parity is a big goal to achieve and there is still a long way to go. All of their assessment is based on ‘sentence level’ selection instead of document level text. ‘Document level’ translation demands better coherence, consistency of use of words/terminology and flow of language. Sentence level translation would miss humorous writing or satire. They don’t report on pronominal anaphora. They don’t report performance on a variety of data. If these are dialogues from movies, or news articles or works of fiction. Since meaning is missing, some languages have orders of respect in language(Hindi, French), which you may not be able to capture without having awareness of discourse. Also a human can understand chinese written in english script, but this system removed all chinese text not having any chinese words.
This paper is a good progress though but I don’t think it is solved. Textual Cohesion and Coherence is not evaluated here. BLEU score used here (or sacreBLEU) has problems with understanding the nuanced differences in content meaning and syntactic order. Chasing BLEU score may not be a good goal to achieve Human Level Parity. We need to move towards discourse aware metric.
6 encoder and decoder units. Encoder is a stack of (self attention + feed forward network ) Self attention helps look at other words, while one word is being encoded. Self attention helps know what each word referring to . Suppose a sentence. “Sally is going there. She likes New york.” Self attention will help understand what the word ‘it’ referring to.
Decoder unit is a stack of (self attention encoder decoder attention + feed forward network). The encoder decoder attention helps focus on particular parts of the encoding while doing the decoding (ie helps utilize the context available).
Why do we have multi head attention? It is because: you start with multiple initialization of different weight matrices, and it ensures that you don’t get obsessed with one particular encoding. You have 8 different instances to look at .
There is a lot of focus on Exposure Bias (remember we are not using rnn. We are using transfomers).The RNN models the data via a fully-observed directed graphical model: it decomposes the distribution over the discrete time sequence y1, y2, . . . yT into an ordered product of conditional distributions over tokens. This is called teacher forcing (at the time of generation ). So the problem is serious when the length of the sequence is very long (so we would need really long sentences to truly claim if the KL divergence term has solved this issue). The right to left and left to right thing. If we look at the experiments which they conducted for this <><><><>
References 1.http://jalammar.github.io/illustrated-transformer/ 2.https://arxiv.org/pdf/1808.04064.pdf 3.Professor Forcing ,Lamb et al https://arxiv.org/pdf/1610.09038.pdf