Academic research has been foundational to progress in neural machine translation, according to Senior Deep Learning Engineer Chip Huyen. But, as she explained to the audience at SlatorCon San Francisco 2019, there are still discrepancies between research environments and industry realities.
Huyen’s own background is a mix of academia and industry; her résumé includes time at Netflix and experience teaching Tensorflow for Deep Learning Research at Stanford, her alma mater. In 2018, she joined NVIDIA, a company that builds the hardware that brings AI into production. NVIDIA is the inventor of the GPU, or graphics processing unit, which brings the raw computational power to AI.
Huyen opened her presentation by revisiting the transition from statistical, phrase-based machine translation to today’s neural machine translation (NMT) models and frameworks. She explained that while the output of older models might have been more predictable, it lacked the natural fluency of today’s neural models.
NMT, however, continues to have limitations. It typically requires massive amounts of data, and the translation quality tends to degrade the longer the sentence.
Huyen said that one main goal of current research is to decrease the reliance on data. “In research, we work with datasets that are millions of sentence pairs,” she said. “And when we work with our clients, we ask them how much data they have and they say, ‘A lot, like, 10,000 pairs.’ We say, ‘That’s not enough.’”
One method, which Huyen explored in a recent study, is to support training with monolingual data, rather than source language, to target language data.
Another option is leveraging similar languages that share common sub-words, and pairing a low-resource language with a high-resource language. In 2016, Google Translate did just that, pairing Azerbaijani, a low-resource language, with Turkish. This pairing improved Google Translate’s work from Azerbaijani into English. On a larger scale, it demonstrated the system’s ability to translate between pairs of languages it had not encountered previously.
Building on this success, the Google AI team published a research paper in July 2019 describing efforts to “[build] a universal neural machine translation (NMT) system capable of translating between any language pair.” The system was trained using over 25 billion examples and is capable of handling 103 languages.
In Reality, No Sentence Is Too Long
Increasing the memory of a neural system can condition it to handle longer sequences, which is important when NMT is used outside of research. “In industry, you can’t say, ‘This sentence is too long and we’re not going to translate it,’” Huyen said.
Huyen cited Transformer XL as one tool that is used to break long sentences into shorter sequences. As the system processes a text, Transformer XL then uses hidden states from the previous sentence to help with the current sequence. Using context rather than sentence representations as another input for the system can also help improve memory, she added.
Feeling Bleu
As NMT becomes more refined, the need for effective quality evaluation techniques becomes more obvious.
ROUGE and BLEU, perhaps the most familiar method, measure n-gram overlapping; that is, how much reference text and the translated output overlap. (Another well-known technique, NIST, provides a weighted BLEU score.)
Although BLEU is still widely used in academia, Huyen pointed out that its reliance on reference text makes it impractical for industry use.
“You need to enumerate all of the possible translations, which is near-impossible”
“You need to enumerate all of the possible translations, which is near-impossible,” she said. Compounding the reference text requirement, BLEU does not take into account semantics, and does not map human judgment well.
Given the shortcomings of quality evaluation techniques like BLEU, quality estimation aims to predict the quality of machine translation output without using reference texts (e.g., to make it possible to estimate post-editing time). Huyen described quality estimation as “very under-explored in research,” noting that it seems to be “mostly driven from industry and not from academia.”
There are other hurdles as well. It is difficult to convince people of the merits of a new metric, and, at the moment, there is no real way of replicating human judgment, she explained.
Huyen’s own research has included developing a matrix to evaluate machine translation output without a reference text. The project, MT Evaluation Without Reference (MEWR), sought to evaluate translations by comparing their style and content to those of source sentences. The resulting fidelity score had a strong correlation with the corresponding BLEU score, and weaker correlations relative to fluency and human judgment.
Not Yet
Much of the current research on NMT is interrelated. One priority is to improve translations of entire documents. Focusing on the document as a whole may promote the use of new evaluation or estimation techniques, because BLEU tends to focus on quality on a sentence-by-sentence basis.
This could also allow systems to adapt to multiple domains within one dataset. “In research, all of our datasets are really well-defined, so you have datasets on news or sciences or movie dialogues,” Huyen said, “but in real life people have conversations about a variety of topics.”
Lastly, Huyen predicts the future may bring more opportunities for what she calls “hybrid human-machine translation.”
“Wherever I go, people keep asking me if AI is going to replace translators,” Huyen said. Based on the challenges MT still faces, she said, “I guess the answer is not yet.”
SCSF19 Presentation Chip (NVIDIA)
676.26 KB