From Hype to API: Baidu Productizes Speech-to-Speech Translation

China has been zealous in its pursuit of artificial intelligence (AI) systems in language technology, not least those relating to speech.

In November 2018, Baidu, one of the country’s leading tech companies, took some flak after appearing to hype the capabilities of what it called a “simultaneous translation” system in a research paper titled “STACL: Simultaneous Translation with Implicit Anticipation and Controllable Latency using Prefix-to-Prefix Framework.”

The tech giant claimed to be able to translate “just a few seconds into the speaker’s speech and finish just a few seconds after the speaker finishes, very much like a simultaneous interpreter.”

Within less than a year, Baidu has taken the idea from research paper to API, releasing a speech-to-speech “machine simultaneous interpretation service” on August 16, 2019. However, STACL is not the major breakthrough Baidu’s researchers claimed it to be. STACL is based on what its researchers call a wait-k model, where users can specify how many words STACL should wait for before beginning translation.

SlatorPod – News, Analysis, Guests

The weekly language industry podcast. On Youtube, Apple Podcasts, Spotify, Google Podcasts, and all other major platforms.

SlatorPod – News, Analysis, Guests

Based on this partial input, STACL attempts to predict the words a speaker may utter. The longer the wait time, the more accurate STACL’s output — which all takes away from its claim of being simultaneous.

On the Baidu product page, it is not immediately clear if the product, which is available for Chinese and English, is indeed based on STACL, though some of the terminology is similar (“low latency” and “predictive modeling based on semantic units information transfer and speaker synchronization” as translated by Google Translate). There has also been no further news on STACL, so there is no telling if the model has been developed further.

In other Baidu language tech news, the Internet giant released an NMT model, ERNIE 2.0, based on what it calls “continual pre-training.” In a paper published on on July 29, 2019, Baidu claims that ERNIE outperforms Google’s BERT and XLNet on 16 tasks, including English tasks on GLUE benchmarks and several common tasks in Chinese.

Of course, Baidu is not alone among big tech in overselling its translation technology capabilities. For example, in a July 2019 demo, Microsoft hyped its mixed reality, AI, and translation technologies, creating the appearance that they are now able to have a hologram of a speaker give a Japanese presentation based on English input without a human in the loop.

At the time, we contacted Microsoft to ask if the raw machine translation output of Azure (Microsoft) Translate was edited by a human before being used in the demo. Microsoft said it does not have any comment at this time.