Facebook Takes a Stab at True Speech-to-Speech Translation via Textless NLP

Facebook Speech-to-Speech Translation

Facebook is striking out its own path to achieving true speech-to-speech translation (S2ST): removing the text translation step from the very beginning and just working with audio.

As previously mentioned, a common S2ST cascade comprises automatic speech recognition (ASR), transcription, text machine translation (MT), and text-to-speech synthesis into the target language.

While Google has been working to eliminate the text-MT step via Translatotron, a project the search giant first unveiled (SlatorPro) back in the spring of 2019, Apple has been on a similar quest, albeit via a more introspective route.

As Apple scientists pointed out, the cascade using the text-MT step has been the only feasible approach until recently; the lion’s share of any progress being attributable to improvements in (and hindered by the limits of) ASR and MT (lots of expertise on both globally).

Of course, these improvements are nothing to scoff at. As Jochen Hummel pointed out at last week’s SlatorCon Remote, GPT-3 is definitely a big deal.

In a September 9, 2021 blog post, however, scientists at Facebook AI noted what they called “an important limitation” of prior improvements; that is, they are “mainly restricted to languages with very large text [datasets] suitable for training AI models.”

To be fair, Facebook scientists acknowledged that GPT-3 (as well as BERT, etc.) did indeed make “huge strides” and could be fine-tuned to apply to a variety of complex natural language processing (NLP) use cases. But all these prior tech still depend on text and Facebook thinks their new model “breaks free” of that.

The social media giant calls its new model Generative Spoken Language Model (or GSLM) and said it “leverages recent breakthroughs in representation learning, allowing it to work directly from only raw audio signals, without any labels or text.”

According to a September 7, 2021 paper, Facebook AI engaged in 6,000 hours of model training on one of two English datasets taken from audiobooks.

As the scientists so eloquently blogged, their model “opens the door to a new era of textless NLP applications for potentially every language spoken on Earth — even those without significant text [datasets].”

In short, GSLM aims to render ASR obsolete by working in true end-to-end fashion, Facebook said; from speech input to speech output.

A Textless NLP Future

As only big tech can, Facebook gathered a multidisciplinary team of researchers to work on GSLM; experts in psycholinguistics, signal processing, speech processing, and NLP.

They likened their approach to how preschool children learn language “solely from raw sensory inputs and audio interactions” (hence the psycholinguistics, etc.), using this as a template for their new textless NLP model.

Facebook highlighted the importance of the textless NLP approach, summarized as follows:

  • It can be applied to training models for any spoken language.
  • Models can incorporate nuances and intonations in speech that denote emotions (e.g., anger, irony, uncertainty) and even “vocalizations” (laughter, yawning, etc.).
  • It can be used to train models on audio-first experiences (e.g., podcasts), bypassing annotation or training ASR.

Early feedback on the Facebook model was generally enthusiastic at this writing, with one concerned tweet over the eventual fate of ASR, and another balking at the supposed hype.

The social media giant foresees a host of applications for its new model — multilingual video games, content search, summarizing archived audio — and said “textless NLP technology should make AI more inclusive and able to model a richer variety of languages than is possible today.”