TikTok Parent Unveils PolyVoice, Speech-to-Speech Translation with Language Models

Tiktok Speech to Speech

ByteDance, TikTok’s parent company, is stepping up in the speech-to-speech translation (S2S) game with its newly proposed PolyVoice – a language model-based framework.

Announced in a research paper on June 13, 2023, the China-based tech company introduces a decoder-only model to enable direct translation, diverging from the traditional encoder-decoder modeling, which remains prevalent in speech modeling.

As noted in the Slator Interpreting Services and Technology Report, published in late 2022, research and development activity in S2S translation is booming. Meta has contributed to data collection through the release of a large-scale multilingual corpus. Rival tech giant Google has been active in technological development demonstrated by the release of its fully unsupervised Translatotron3 model.

A key feature of PolyVoice is its ability to generate and use “discretized speech units”, which allows for transforming the continuous stream of spoken language into digestible, intelligent fragments. Moreover, this process takes place in a fully unsupervised manner. 

It efficiently filters the important information inherent to the speech and represents them in small chunks called semantic units. 

This feature is particularly useful for languages with no writing system since the text-based approaches usually appear to be inadequate for these languages.

The Two Pillars

PolyVoice integrates two language models: a translation language model and a speech synthesis language model. The first model is responsible for conveying the meaning of the source speech into the target language. The second language model, in turn, generates the target speech making sure that the target output mimics the voice and other characteristics inherent to the source speaker. 

The ability to clone the voice and the speaking style of the original speech is ensured by an approach based on Microsoft’s Voice Replicator VALL-E X celebrated for its ability to replicate the nuances of human speech.

The system cleverly merges the semantic units of the original and translated content with the source audio elements. This combined sequence is processed by an audio language model predicting how the translated text should sound. Finally, this model transforms these audio predictions into a playback-ready format, effectively synthesizing the translated speech.

PolyVoice refrains from the conventional two-step encoder-decoder model. It relies on a novel decoder-only approach, which makes it possible to translate the source speech into the target language without intermediate representations. This is an attempt to streamline the translation process, which may lead to lower latency and more natural output.

Unwritten, No Problem

From the perspective of global communication, the most notable takeaway from PolyVoice is its ability to support unwritten languages. It can create new communicative perspectives for the communities whose languages have been predominantly oral.

Furthermore, the advanced audio language model of PolyVoice makes it possible to retain the original speaker’s voice and style making the translations feel more natural and personal.

As for the modeling standpoint, the innovative decoder-only model can make a lasting impact on the whole speech translation process eliminating the well-known problems associated with conventional modeling i.e. error propagation, latency, paralinguistics information loss, etc.