Since being open sourced by Google in November 2018, BERT has had a big impact in natural language processing (NLP) and has been studied as a potentially promising way to further improve neural machine translation (NMT).
An acronym for Bidirectional Encoder Representations from Transformers, BERT is a pre-trained, contextual language model that represents words based on previous and following context.
Quick recap: NMT basically reads in an input (with an “encoder”), and then tries to predict an output (with a “decoder”). During training time, the model is fed training data consisting of input-output pairs, and it adjusts its parameters to maximize the probability of generating the correct output given the input.
When researchers train a language model, the process is almost the same, except there is no input. “The model just tries to maximize the probability of generating sentences in the target language, without relying on any particular input,” Professor Graham Neubig of Carnegie Mellon University’s Language Technologies Institute explained to Slator.
With BERT, Neubig added, “a model is first trained on only monolingual text data, but in doing so it learns the general trends of that language, and can then be used for downstream tasks.”
In practice, pre-trained BERT models have been shown to significantly improve the results in a number of NLP tasks such as part of speech (POS) tagging.
Exactly how BERT has managed to outperform other models is unclear. As explained in a September 2019 paper by a team at the University of Massachusetts Lowell, one advantage is BERT’s self-attention mechanism (Transformer), which offers an alternative to recurrent neural networks (RNNs). Researchers are now exploring BERT’s capacity to capture different kinds of linguistic information.
According to SYSTRAN CEO Jean Senellart, using a masked language model like BERT for NLP tasks is relatively simple because BERT is pre-trained using a large amount of data with a lot of implicit information about language.
“Given that [BERT is] based on a similar approach to neural MT in Transformers, there’s considerable interest and research into how the two can be combined” — John Tinsley, CEO, Iconic Translation Machines
To handle an NLP task, Senellart said, “we take a BERT model, add a simple layer on top of the model, and train this layer to extract the information from the BERT encoding into the actual tags that we are looking for.” This is called “fine-tuning,” and it is used only to extract information that is already known to be present in the encoding.
So, what makes BERT relevant now? As John Tinsley, Co-founder and CEO of Iconic Translation Machines, explained to Slator, “Given that [BERT is] based on a similar approach to neural MT in Transformers, there’s considerable interest and research into how the two can be combined.”
Back in from the Cold
Progress in this stream of research represents something of a comeback for language models, which “were an integral and critical part of statistical MT. However, they were not inherent to neural models and, as such, fell by the wayside when neural MT hit the scene,” Tinsley said.
A September 2019 paper by South Korean internet company NAVER concluded that the information encoded by BERT is useful but, on its own, insufficient to perform a translation task. However, it did note that “BERT pre-training allows for a better initialization point for [an] NMT model.”
Other experts who spoke to Slator seem to agree that BERT may be a jumping off point for more custom-made solutions.
“BERT itself is perhaps not the best fit for pre-training NMT systems, as it does not predict words left-to-right, as most NMT systems do,” Neubig said. “But methods like BERT are already proving quite effective in improving translation results.”
As an example, Neubig cited a May 2019 Microsoft paper on “a new technique for pre-training in NMT that is somewhat inspired by BERT but directly tailored to match the way we do prediction in NMT” as showing “very promising results.”
The computing and training complexity overhead involved at this point in time make it unlikely to be used for industrial applications in the near term” — Kirti Vashee, Language Technology Evangelist, SDL
However, in terms of bridging the gap between research and commercial use, “the computing and training complexity overhead involved at this point in time make it unlikely to be used for industrial applications in the near term,” said Kirti Vashee, a Language Technology Evangelist with SDL.
“NMT will evolve like SMT, but the NLP research is moving too quickly for people to justify incurring the expenditure in time, training, and computing expense in the near term without very clear evidence that it is worth doing,” Vashee added.
Leveling the Playing Field for Low-Resource Languages
Within NMT, the improvements achieved by BERT have, so far, been seen mostly in low-resource or unsupervised NMT settings, as noted in the September 2019 NAVER paper.
Rohit Gupta, a Senior Scientist for Iconic Translation Machines, predicted that, in the shorter term, BERT is “likely to have a bigger impact on lower resource languages because we can easily get monolingual data.”
Gupta added that pre-training on one language can also positively impact other languages. “For example, we can use English data to improve language modeling for Nepalese,” he said.
Part of the challenge of using pre-trained BERT to train an NMT model, though, is that “the obvious integration does not work well,” Senellart told Slator.
“You can get some improvement for languages with [limited] resources” — Jean Senellart, CEO, Systran
“You can get some improvement for languages with [limited] resources, but for a language with a lot of resources, what happens is that the encoder — and even more the decoder — loses all its prior knowledge when learning how to translate because it has a lot more obvious features to learn,” Senellart said.
Several recent papers have explored new techniques for the integration, Senellart noted, and “the main idea is to integrate the language knowledge in the encoder and decoder, but as an additional source of information (features) that the encoder or decoder can use.”
Samuel Läubli, CTO of Swiss language technology company TextShuttle, believes integrating document-level context will be critical to advancing NMT.
“As long as systems keep focusing on translating sentences in isolation, BERT alone won’t fix the problem” — Samuel Läubli, CTO, TextShuttle
“Ultimately, users are interested in translating whole documents, with consistent terminology and correct references to words in other sentences,” Läubli said. “As long as systems keep focusing on translating sentences in isolation, BERT alone won’t fix the problem.”
Speaking on behalf of TransPerfect, Director of Artificial Intelligence Diego Bartolome told Slator, “We haven’t seen BERT impact our NMT approach yet,” because TransPerfect has optimized its own tools for (and has enough data to work with) its top 40 languages.
However, Bartolome said, BERT “has a role in other solutions we create,” including areas such as “question-answering (chatbots), summarization, and natural language generation, where we clearly see a value.”