The Art of Hyping Machine Translation

Baidu is China’s top search engine, one of the country’s leading proponents of artificial intelligence, and an excellent study in public relations handiwork.

In mid-October 2018, Baidu’s research team published a research paper titled “STACL: Simultaneous Translation with Integrated Anticipation and Controllable Latency.” The publication came in the run up to the 2018 edition of the Conference on Empirical Methods in Natural Language Processing, as thousands of the world’s leading researchers in machine translation and natural language processing descend on Brussels to share their latest findings.

At the time, the Baidu paper seemed like just another incremental development in neural machine translation (NMT) research. But then Baidu’s PR machinery began churning.

Baidu’s PR agency sent press releases, the research paper itself, and a GitHub demo page to numerous media outlets for coverage—Slator included. The agency wrote that STACL “can translate just a few seconds into the speaker’s speech and finish just a few seconds after the speaker finishes, very much like a simultaneous interpreter.” Within hours, widespread media coverage ensued:

  • “Linguists, update your resumes because Baidu thinks it has cracked fast AI translation,” by The Register
  • “Baidu’s AI Can Do Simultaneous Translation Between Any Two Languages,” by IEEE Spectrum
  • “Baidu’s Chinese-to-English translator finishes your sentence for you,” from MIT Technology Review
  • “Baidu Looks to Amaze With A.I. Translation in Real Time,” from
  • “Baidu creates the world’s first simultaneous translation system,” from SiliconAngle
  • “China’s Baidu challenges Google with A.I. that translates languages in real-time,” by CNBC
  • “Baidu to debut simultaneous machine translation in latest challenge to Google,” from SCMP

Other headlines were more straightforward yet still took the well spun PR message at face value:

  • “Baidu launches simultaneous language translation AI,” from VentureBeat
  • “Baidu develops its own take on real-time translation (Google has some fresh competition),” from Engadget
  • “Baidu releases a new AI translation system, STACL, that can do simultaneous interpretation,” by Packt Hub

Links to the media coverage the paper garnered including logos of the outlets were displayed on the Baidu Research group’s Github demo page.

As these articles and headlines stand uncorrected, they are now beginning to be picked up by other media outlets over time. Case in point, Axios linked to the IEEE Spectrum article in a piece on the dominance of English in scientific discourse.

As part of the PR push, Baidu conducted a public demo of STACL at the November 1 annual Baidu World Congress. During the event, two screens at either side of the main display displayed automatic speech recognition output and STACL’s simultaneous translation at the same time.

Baidu STACL Demo: Two screens show automatic speech recognition output on one and STACL simultaneous translation in another

However, the simultaneous interpretation available on the live stream was still provided by human interpreters.

Unpacking STACL

So what is STACL and does the research merit such coverage? Compared to a regular NMT engine that “processes” an entire sentence and then translates it, STACL is “fed” the source sentence word by word, and begins translating the input once it reaches a specified number of words. Baidu’s research team called it a wait-k model, where users can specify how many words STACL should wait for before beginning translation.

STACL translates the partial input as it goes and attempts to predict words that may only appear towards the end of a source sentence. For example, in languages like German where important parts of speech such as verbs and negation typically appear at the end of the sentence, STACL will need to guess the verb or negation as it translates a German source sentence into, say, English. STACL’s prediction method also uses context from the language data on which it is trained.

The researchers conducted experiments on English-to-German and Chinese-to-English translation tasks.

The catch? At a wait-5 model (system waits five words before beginning translation), STACL’s output quality is slightly below the current state-of-the-art. Go lower, such as a wait-3 model (system waits three words), and the predicted words can be completely wrong.

Indeed, the research paper points out that the more words the system has to wait for, the better the output becomes, so it is essentially a balancing act between how simultaneous the translation is and how fluent it becomes.

Time to Update Interpreter Resumes?

Baidu’s press release said the research was “inspired by human simultaneous interpreters, who routinely anticipate or predict materials that the speaker is about to cover in a few seconds into the future.”

In the research paper’s introduction, the endgame is made apparent: “Simultaneous translation has the potential to automate simultaneous interpretation.”

So should interpreters heed The Register’s words and update their resumes? Nope. Slator reached out to NMT experts to get their take on STACL:

“Firstly, I think we still need to be careful not to classify every new piece of research as a breakthrough,” said Dr. John Tinsley, Co-Founder and CEO of Iconic Translation Machines. He said simultaneous translation “has been around for a while, and there have been a few implementations in the context of NMT already.”

“We still need to be careful not to classify every new piece of research as a breakthrough” — John Tinsley, Co-Founder and CEO, Iconic Translation Machines

Tinsley noted that exploring this technology particularly for interpretation is “enticing,” adding that “it’s positive to have good research teams working in this area and taking steps forward. That being said, I think that’s what we’re looking at here—steps forward.” He also informed Slator that he and his team will be “digging a little deeper under the hood” of the STACL paper in the future.

A Solution in Search of a Problem

Professor Andy Way, Full Professor at Dublin City University and Deputy Director of the ADAPT Centre for Digital Content Technology, said: “It looks to me like this is a solution in search of a problem.”

“If this is not a tool to support interpreters, but instead is intended to replace them, then I think you know based on my track record what I would say about that,” he added.

Prof. Way said simultaneous interpretation is one of the hardest tasks in the business, and human interpreters are “great at what they do.” With that in mind, he noted that STACL’s lower BLEU scores raises deployment issues.

“The situations in which interpreters are deployed are usually places where critical decisions need to be made. Would you trust a machine in such a situation?” — Professor Andy Way, Full Professor at Dublin City University

“Not a Scientific Breakthrough”

As for Dr. Jean Senellart, the Systran Global CTO provided Slator with an impromptu, brief version of a review he said he would give if the paper was submitted to a conference. He noted that after conferring with Chinese colleagues, he was told the announcement made a lot of noise and it seemed like it was presented “as being equivalent in importance to the paper of Microsoft [on achieving human parity].”

Dr. Senellart’s first point was that the presentation was misleading because in an actual consecutive interpreting scenario, while the interpreter may be translating after every full sentence, he or she is also part of the communication flow, to the point that the speaker may pause to allow the interpreter to ask questions or clarifications before translating.

“In this workflow, we are indeed doubling the time needed for the communication,” he said.

In the case of MT, Senellart clarified that waiting a certain number of words before translating, as STACL does, is not necessary. “The machine can listen and compute simultaneously, there would not be a need to pause the communication so both would be qualified as an additive latency,” he said.

Senellart said that on a 30 minute speech, an NMT engine that translated after each full sentence and STACL “would only differ by a few seconds.” He added that those few seconds “would not justify a loss of quality of several BLEU points.”

Ordering Food

Slator also reached out to two of the authors of Baidu’s STACL paper: Mingbo Ma and Liang Huang. Huang addressed the above concern, conceding that it is “true only for speech-to-text translation in a conference setting.” However, he offered what he called “a more common scenario: 1-on-1 dialog (e.g., ordering food).”

“Here this full-sentence translation strategy would cause a 2x delay,” he said.

“While our current technology is still speech-to-text, we are also working on expanding it to simultaneous speech-to-speech translation.” — Liang Huang, Principal Scientist, Baidu Silicon Valley AI Lab

In his impromptu review, Senellart continued that “the use case is wrongly presented and wants to drive the reader to some understanding that the machine has reached a new ‘superhuman ability’.”

To this point, Huang simply stated: “We never claimed “superhuman ability”.”

Senellart also addressed a claim in the paper where the authors said STACL’s overhead is “much more desirable” because it was additive instead of multiplicative like the overhead of an NMT engine that translates after every full sentence.

“It is wrong,” he said. “Because even if this statement is true (consecutive vs. simultaneous in the human world), the method they are trying to replace (waiting to the end of the sentence) is falling in the same category than their presented method (both are additive).”

As for the science behind it, Senellart said “there is no new idea” behind STACL: “the solution they propose would be the default solution anyone who did not read the paper would set up.” He clarified that this is not necessarily bad, “but this is not a scientific breakthrough.” Indeed, simultaneous translation has been looked at before, prior to NMT, and there have been proposals for incorporating it into an NMT engine as early as 2016.

Small BLEU Penalty; Huge Translation Error

Finally, Senellart noticed that there was “almost no qualitative or quantitative evaluation [performed], and this absence most likely covers for big errors.” He pointed out in one of the examples given in the research paper and the GitHub demo page, STACL correctly predicted and translated “to meet.”

“But here what if the final verb is not meet, but “call”, or something totally different?” Senellart said. “The change of the verb will be only a small BLEU penalty but a huge translation error—how does the model try to recover (if it does) when it finally encodes the verb and “realizes” that the full sentence is wrong?”

Huang confirmed that they did only use BLEU, but also said that “we predict (or “anticipate”) all kinds of words, not just verbs, and if there is a major problem in the prediction, it should reflect in BLEU score as well.” He added that they intend to incorporate human evaluations on prediction accuracy.

“We also acknowledge that (a) our current work can’t fix a previous prediction mistake, and (b) it can’t even “detect” such a mistake,” he said. “But we are working on ways to address these problems.”

Perhaps anticipating a bit of pushback from the machine translation community—with Microsoft’s controversial “human parity” claim a cautionary tale—Baidu did add a closer to their press release stating that “STACL is not intended to replace human interpreters, who will continue to be depended upon for their professional services for many years to come.”

According to them, it was merely meant to make simultaneous translation “more accessible.”

Editor’s note: This article has been updated to correct a minor issue depicting STACL’s demo during the Baidu World Congress live stream on November 1.

Download the Slator 2019 Neural Machine Translation Report for the latest insights on the state-of-the art in neural machine translation and its deployment.