Apple Scientists Go on a ‘Quest Toward Overcoming’ Speech Translation

Apple Speech to Speech Neural Machine Translation Research

After Google and Amazon, Apple has turned the page on the broad issue of speech translation. Or at least — to borrow the words of the authors of a recent paper — taken stock of where we are.

Published on the popular preprint server arXiv, on April 17, 2020, “Speech Translation and the End-to-End Promise: Taking Stock of Where We Are” was authored by two research scientists from Apple.

Matthias Sperber is a Siri Machine Translation R&D Scientist based in the German spa city of Aachen. Matthias Paulik is a Senior Manager out of Cupertino HQ.

The two Matthiases got their PhDs from Karlsruhe Institute of Technology (KIT) a decade apart. Both served on the KIT research staff, with one focusing on automatic speech recognition (ASR), machine translation, and neural networks (Paulik), and the other, linguistic annotation, ASR, and speech-to-text (Sperber).

In their recent paper, Sperber and Paulik surveyed three decades’ worth of research into speech translation, defining its challenges, techniques, and requirements to “encourage meaningful and generalizable comparisons on our quest toward overcoming the long-standing issues found in ST models.” As the authors put it, “Given the abundance of prior work, a clear picture on where we currently stand is needed.”

As defined by the duo, speech translation (ST) is “the task of translating acoustic speech signals into text in a foreign language.” And although ST, put simply, has to do with generating accurate text output from speech input, the journey to get there is complex and multifaceted as it builds on previous work in automatic speech recognition (ASR) and machine translation (MT), the authors pointed out.

Taken in the context of Google and Amazon’s prior work (as well as Microsoft’s 2019 hologram demo), the brass ring in all this is, of course, (accurate) speech-to-speech translation.

Crucially, the authors point out that the only feasible approach, until recently, has been “the cascaded approach that applies an ASR to the speech inputs, and then passes the results on to an MT system.”

They note that there has since been progress in ST on two fronts: “general improvements in ASR and MT models, and moving from the loosely-coupled cascade in its most basic form toward a tighter coupling” (more under Chronological Survey below).

Sperber and Paulik qualify that “a large share of the progress has arguably been owed simply to general ASR and MT improvements [but] “recently, new modeling techniques and in particular end-to-end trainable encoder-decoder models have fueled hope for addressing challenges of ST in a more principled manner.”

They go on to say, however, that “despite these hopes, the empirical evidence indicates that the success of such efforts has so far been mixed”; thus, their attempt to uncover the potential reasons behind this through their study.

Sperber and Paulik’s paper, basically, does three things: First, it analyzes the historical development of broader speech translation. Next, it carves out the challenges related to ST — pointing out that the research has, thus far, been insufficient in analyzing these challenges. In so doing, the paper then highlights open research questions that can hopefully be addressed in future studies.

Chronological Survey

The paper begins with a chronological survey of more than 30 years’ worth of ST research, introducing key concepts. For instance, it cites two early papers from 1988 and 1991 to define “the loosely coupled cascade,” where researchers used separately built ASR and MT systems and then used “the best hypothesis of the former […] as input to the latter.”

According to the authors, such early systems were prone to errors “propagated from the ASR, given the widespread use of interlingua-based MT which relied on parsers unable to handle mal-formed inputs.”

They added that subsequent systems, which relied on data-driven, statistical MT, “somewhat alleviated the issue, and also in part opened the path towards tighter integration.”

Also noteworthy: Sperber and Paulik point out that “the possibility of speech-to-speech translation, which extends the cascade by appending a text-to-speech component, was also considered early on (Waibel et al., 1991).”


The paper then defines “the central challenges, techniques, and requirements, motivated by the observation that recent work does not sufficiently analyze these challenges.”

Some of these central challenges arise from the aforementioned loosely-coupled cascade (e.g., error propagation, mismatched source-language, information loss). Sperber and Paulik then list typical countermeasures for each challenge.

In the case of mismatched source-language, for example — which is caused by (a) modeling assumptions, such as ASR only modeling unpunctuated transcripts and (b) mismatched training data, which leads to “stylistic and topical divergence” — typical countermeasures based on previous studies would be “domain adaptation techniques, disfluency removal, text normalization, and segmentation/punctuation insertion.”

Open Research Questions

In conclusion, Sperber and Paulik suggest possible starting points for future research.

They note, for instance, that “while early decisions and data efficiency have been recognized as central issues, empirical insights are still limited and further analysis is needed. Mismatched source-language and information loss are often not explicitly analyzed.”

Moreover, wrote the authors, “We conjecture that the apparent trade-off between data efficiency and modeling power may explain the mixed success in outperforming the loosely coupled cascade. In order to make progress in this regard, the involved issues (early decisions, mismatched source-language, information loss, data efficiency) need to be precisely analyzed, and more model variants should be explored.”

As for traditional models, they suggest extending rather than altering them by, for example, “applying end-to-end training as a fine-tuning step, employing a direct model for rescoring, or adding a triangle connection to a loosely coupled cascade.”