Zoom Bolts on Speech Translation in What Is Only Its Second-Ever Acquisition

Zoom Buys Real-Time Video Translation Startup Kites

On June 29, 2021, video-conferencing juggernaut Zoom announced on its blog that it was acquiring German simultaneous speech translation provider, Karlsruhe Information Technology Solutions (kites GmbH or, as Zoom now spells it, “Kites”). Karlsruhe is a city in southwestern Germany.

What makes the deal remarkable is that Kites is only Zoom’s second acquisition ever, after chat and file encryption provider Keybase in May 2020.

Zoom’s short M&A track record is, naturally, not attributable to a lack of resources. In addition to its USD 100bn-plus market cap, Zoom sits on a USD 4.2bn cash pile — which CFO Kelly Stackelberg said in March 2021 was going to be used for acquisitions: “There are a lot of innovative companies out there that might be the right match for us. We haven’t found the right one.” Well, it seems now they have — in machine translation.

Founded in 2015 by Alex Waibel and Sebastian Stüker, members of the faculty at Karlsruhe Institute of Technology (KIT), Kites has “56 years of collective AI & Speech Processing R&D experience” under its belt, according to its website. Its UVP, “real-time” speech translation.

It is this focus that apparently motivated the purchase. Zoom said that Kites’ team of 12 research scientists will work with the company’s engineers to “advance the field of MT” and provide Zoom users with “multi-language translation.” In short, Zoom users will be able to enjoy multilingual in-app speech translation in the future.

The same blog post said that Stüker and team will continue to work out of Karlsruhe “where Zoom looks forward to investing in growing the team.” The company added that it is “exploring opening an R&D center in Germany in the future.” Waibel, meanwhile, will be “a Zoom Research Fellow, a role in which he will advise on Zoom’s MT research and development.”

At this writing, Kites’ focus seems to be speech to text. The startup published the paper, “Super-Human Performance in Online Low-latency Recognition of Conversational Speech,” in October 2020 and they appear to excel in the speech-to text component, adding the translation step thereafter.

This approach has, more recently, been rivalled by direct speech-to-speech translation (S2ST), where some big tech companies such as Google have attempted to bypass the interim text-translation step (SlatorPro).

Current state-of-the-art S2ST is still laden with issues — from limits around latency and domains (i.e., only usable for basic topics) — but this may change if more resource-rich giants like Zoom get behind it.

As for Kites, it has flown under the radar as far as language industry startups go; not unusual in the German startup scene. But its UVP squares neatly with Zoom’s and the acquisition signals a significant push into true speech-to-speech translation and, more broadly, automated interpreting.

While this may pose no immediate threat to well-funded, remote simultaneous interpreting (RSI) startups, such as KUDO, Interprefy, or Boostlingo (the latter is even part of the Zoom App Marketplace), the Zoom-kites deal is still a major step in the longstanding competition by automation.