Despite the headlines, the universal translator for conversational speech has not arrived just yet. Microsoft, whose machine translation technology powers Skype Translator — which probably comes closest — admits as much.
In a paper, Microsoft said they observed the “clear negative impact” of inserting Skype Translator into a conversation. They found that people spoke more slowly, used “restricted vocabulary.” and would often “need to ask clarification questions when results are not understandable.”
That is still a far cry from Microsoft’s goal, which is to translate a natural conversation between speakers of two different languages so that “one would not be able to tell the difference between conversations held in one language and those held in two.”
No Free Lunch
In a bid to ask clients and partners to help accelerate progress, Microsoft released a large, 2GB Speech Language Translation Corpus so users of the Microsoft Translator Speech API have a baseline “to evaluate end-to-end conversational speech translation quality.”
Applications for the API range from making large repositories of audio files searchable by transcribing them into text, real-time subtitling and machine translating those subtitles, or one-to-one, in-person or remote live translation (full-circle back to the universal translator).
Users of the technology include Lionbridge (automatic subtitling), telecom provider Tele2 (live translation of phone conversations), and ProDeaf (multilingual support of speech-to-sign scenarios). Microsoft wants the corpus to become the “gold standard…for speech language translation.”
Microsoft does not provide the corpus all for the sake of the greater good, of course. Using the Speech Translation API to transcribe and translate 1,000 hours of audio per month costs USD 7,000 per month; and 10,000 hours leaves you with a USD 35,000 bill.
The corpus was created from actual conversations over Skype to “capture the typical side-effects of Skype’s transport layer.” It contains around 3,000 end-to-end speech translation sets for English, and 2,100 for French and German.
Each set consists of an audio file, a verbatim transcription, a cleaned-up transcription, and a translation based on the cleaned-up transcription. The average length of the audio sequence is 4.7 seconds in English, 5.4 seconds in French, and 6.7 seconds in German.
The nature of the content is conversational (e.g., “And I mean on WeChat you always have updates of new emoticons that you can download”). The audio was transcribed and translated by human linguists. Microsoft recorded 100 speakers for each language with 50-plus pairings.
To simulate the eventual use case (i.e., two people speaking over Skype in two different languages), Microsoft asked bilingual participants to hold a 30-minute conversation, where one spoke either German or French with the other responding in English.
In its blog post, Microsoft said it plans to release an updated version of Skype Translator in 2017 and expand language coverage.