Here’s Europe’s Latest Attempt at End-to-End Speech Translation

Here’s Europe’s Latest Attempt at end-to-end speech translation

If there were ever a shortlist of projects that had the potential to produce a Babel Fish-type translation device, this would probably be on it.

Backed by the European academe, private sector, and government, the project is called ELITR (pronounced “eh-lee-ter”), also known as European Live Translator. The project was born out of the need to provide subtitles for a EUROSAI Congress back in May.

EUROSAI is the European Organization of Supreme Audit Institutions; and the Supreme Audit Office of the Czech Republic initiated the project to help translate speeches in real-time from six source languages into 43 targets: 24 EU languages, plus 19 EUROSAI languages (e.g., Armenian, Russian, Bosnian, Georgian, Hebrew, Kazakh, Norwegian, Luxembourgish).

In an ELITR demo video, Charles University Associate Professor, Ondřej Bojar, said the project also looks into the possibility of “going directly from the source speech into the target language with an end-to-end spoken language translation system.”

In short, speech-to-speech translation (S2ST). For ELITR, however, Bojar told Slator, “We stop at the target text. We are not including the final text-to-speech — although we definitely could.”

S2ST has become a sort of brass ring in research and big tech — as tackled by the likes of Apple, Google (via the so-called “Translatotron”; SlatorPro), and prominent Japanese researchers, who uploaded a toolkit for it on GitHub. Chinese search giant Baidu even drew some flack for claims around it; and, of course, there is a whole graveyard of translation gadgets from companies that tried to commercialize S2ST.

Admittedly, ELITR’s production pipeline currently relies on two independent steps — that is, automatic speech recognition (ASR) and machine translation (MT) and, according to Bojar “we are actually quite good in these two steps” (as evidenced by a paper published on June 17, 2021; and two others published in September and October 2020).

“We’re also investigating the possibilities of going directly from the source speech into the target language with an end-to-end spoken language translation system” — Ondrej Bojar, Associate Professor, Charles University

End-to-end speech translation is part of the long-term vision, as outlined in a recent paper published on the Association for Computational Linguistics portal. “The goal of a practically usable simultaneous spoken language translation (SLT) system is getting closer,” wrote the authors from Charles University, Karlsruhe Institute of Technology, the University of Edinburgh, and Italy-based automatic speech recognition (ASR) provider PerVoice. SLT also encompasses off-line spoken language systems, the authors said.

SlatorCon Remote December 2022 | $150

SlatorCon Remote December 2022 | $150

A rich online conference which brings together our research and network of industry leaders.

Register Now

The authors (Bojar, among them) mentioned two problems of the current system that have yet to be solved.

  • Intonation – which cannot be factored in as punctuation prediction has no access to sound; and
  • Segmentation errors – that is, MT systems tending to “normalize word order,” thus reducing fluency in a stream of spoken sentences.

Hence, “for the future, we consider three approaches,” Bojar, et al. added: (1) training MT on sentence chunks, (2) including sound input in punctuation prediction, or (3) end-to-end neural SLT.”

Working alongside Charles University on ELITR were the University of Edinburgh, and Karlsruhe Institute of Technology. ASR provider, PerVoice, and Germany-based video conferencing platform, alfaview, also participated in the project. Does this mean commercialization plans are on the drawing board?

Bojar told Slator, “For a research institute at a university, commercialization is always something that takes an unbearably long time, but we are definitely very open to many forms of collaboration.”