New Google Research May Improve Live-Translation of Virtual Meetings

Google Simultaneous Translation Research on Live Captions for Use in Virtual Meetings

The virtual meetings space has seen a huge uptick in demand in recent months as global coronavirus lockdowns have taken effect and companies strive to maintain business as usual.

Some of the biggest names in tech, and many other niche players, are vying for their share of the evolving enterprise market for virtual meetings and conferences. As a way to differentiate themselves, virtual meeting providers are now focusing more on integrating language technology, such as machine translation (MT) and automatic transcription.

According to Gartner’s Magic Quadrant for Meeting Solutions, published September 5, 2019, 80% of the vendors they profiled “have some version of meeting transcription, much of which is based on natural language processing (NLP) technology with AI.” It also identified “transcription and translation for delivering webinars, town hall meetings and quarterly business reviews” as a key area of focus for meeting solution providers.

Aggressively pursuing language integration is Google, which is stepping up its efforts to challenge rivals Zoom, Microsoft, Cisco, LogMeIn, and others. Google Meet began offering live captions (automatic transcription) in September 2019 and Google’s researchers have now turned their attention to solving the simultaneous translation problem — very relevant for providing live translated captions for audio or video.

Incoming Stream of Source Words

In a paper published in mid-April 2020, Google researchers Naveen Arivazhagan, Colin Cherry, Wolfgang Macherey, and George Foster compared two major approaches to simultaneous translation: streaming and re-translation. As they explained, the goal of simultaneous MT is to “translate an incoming stream of source words with as low latency [delay] as possible.” The research builds on an earlier paper published in late 2019.

With streaming, which has until now been the dominant approach, translated text cannot be changed once it has appeared on screen. Re-translation, by contrast, allows modifications to the translated text even after it is displayed on screen.

With streaming, which has until now been the dominant approach, translated text cannot be changed once it has appeared on screen. Re-translation, by contrast, allows modifications to the translated text even after it is displayed on screen. Google tested which approach produced better translations (quality) and which led to a shorter delay in translated text appearing on screen (latency).

Using German-to-English and English-to-French language pairs, Google found that re-translation performed better than streaming based on both quality and latency. Quality-wise it produces “high final-translation quality” because the translations are iterated and improved. Latency-wise, it results in less of a delay because it “always attempts a translation of the complete source prefix.”

Moreover, since on-screen changes to text may be difficult for readers to parse, Google also tried limiting the number of changes allowed to see how this may impact translation quality.

When just one in five translated words were allowed to be changed, re-translation was “as good or better than state-of-the-art streaming systems.” And even when no revisions were allowed, re-translation was “surprisingly competitive.”

Google identified a number of other strengths in re-translation, namely that as well as allowing changes to the translated text, it works with any MT system and is easy to implement. It also captures improvements in NMT technology and can be tweaked to modify the latency-quality trade-off.

Re-Translation as a Strong Baseline

Overall, the researchers said they see re-translation as a “strong baseline for future research on simultaneous translation,” and suggested further research could explore a “lower-cost solution that preserves the flexibility of re-translation.” 

As for Google’s competition in meeting solutions based on AI transcription, a lot of providers, including Zoom, appear to provide transcriptions after the event based on recorded audio rather than live captions during the event. For real-time transcription, Cisco (Webex), provides “real-time transcription and closed captioning during the meeting, as well recordings and transcripts after the meeting.” 

Gartner did not specifically identify any of the providers they profiled as currently offering simultaneous translation, but they did point out that Huawei already offers “automated translation of meeting transcripts in two other languages.” Skype (owned by Microsoft) does in fact offer a limited live audio translation feature, but it is not available in group calls.

In Reality, Not Quite There Yet

LogMeIn’s Head of Localization, Hartmut von Berg told Slator that, “unfortunately we don’t have a live translation built into GoToMeeting yet. That would be awesome but technology isn’t good enough at the moment for speech. Chat would be doable but isn’t implemented either.”

While AI transcription solutions are more or less well implemented, real-time translation and real-time interpreting of meetings are not yet a solved problem, although high-profile attempts have been made. While Chinese Internet giant Tencent’s debut of its simultaneous interpreting technology backfired spectacularly during a prestigious live event back in 2018, Microsoft’s showcase in 2019 of its Japanese-speaking hologram, which relied on text to speech technology, got a far better reception.

With Google planning further research into simultaneous translation, it may not be long before it is deployed in Google Meet in the form of live translated captions. And simultaneous translation could be applied well beyond virtual meetings in live subtitling for news and media.