BBC News Labs Runs Extensive Machine Translation and Automatic Transcription Test

BBC Machine Translation

Headquartered in London, BBC News broadcasts in 43 languages — a daily operation that requires much effort behind the scenes, typically by the same journalists who research and report the news. Indeed, the journalists do not see themselves as translators, and relate to their source and target material differently than language professionals might.

“Any assessment of transcription/translation solutions needs to reflect journalists’ requirements from the output, as they are the intended end-users of our solutions,” BBC News Labs Senior Research and Development Producer Sevi Sariisik Tokalac explained in a November 28, 2023 write-up.

In other words, beyond “good enough” or “fast enough,” the tech has to prove that it will work in a fast-paced newsroom. 

More specifically, News Labs investigated whether post-editing machine translation (MT) and automated transcription output would be more practical and efficient than translating and transcribing without that assistance.

Impressive Leap

Sariisik Tokalac said that News Labs has experimented with various transcription and MT models for more than a decade. Over that time, the group has observed an “impressive leap in quality and the number of languages covered by commercial and research-led models.” 

News Labs devised an experiment to identify the best-performing models for MT and transcription for work in Arabic, French, Brazilian Portuguese, and Spanish (paired with English for the relevant tasks). The BBC has large teams and multiple outlets for each of these languages, and language models typically perform very well in these high-resource languages.

Evaluators, nominated by their teams based on their language skills and editorial judgment, checked, corrected, and evaluated about 45,000 words spread across three tasks: non-English-language transcription; translation from English; and translation into English. The content was distributed evenly across genres, such as politics, health, science, economics, and societal issues — all of which comprise standard fare in regular BBC programming. 

Researchers pushed content through a number of language models, among them household names including AWS, DeepL, Deepgram, Google, Microsoft Azure, Speechmatics, and OpenAI’s Whisper.

Offline, evaluators timed themselves and tracked their corrections, assigning each sample a quality score on a scale of 0-100. 

“The scoring brief to the evaluators was not to seek perfection, but consider what they might reasonably expect from a sharp, fresh graduate starting a work placement: a body of text with no major errors, but some minor ones, and might need stylistically refining to align with the BBC’s content,” Sariisik Tokalac wrote.

On a second go-around, evaluators categorized and color-coded their corrections as major errors (which could impact meaning); minor errors (which need correcting to be usable); or stylistic enhancements.

“The scoring brief to the evaluators was not to seek perfection, but consider what they might reasonably expect from a sharp, fresh graduate starting a work placement” — Sevi Sariisik Tokalac

Best in Show

BBC News Labs ultimately ranked models based on the highest scores, the least time required to revise 1,000 words, and prioritized the models that produced the fewest “major errors.” The group’s final shortlist has the top two models for each language, which Sariisik Tokalac declined to name publicly. More evaluators will be pulled into the mix, assessing more samples and providing more feedback on the top two models for their language for these tasks. 

“We may apply a more a detailed categorisation of the error types and introduce a method of automating the calculations,” she added, pointing out that while human rankings correlated strongly with those from algorithmic methods such as BLEU, TER, and COMET, there were discrepancies between evaluators. 

The time savings are also considerable, with MT, especially for “languages of proximity,” cutting delivery time by about one-third. For transcription, the benefit is even more pronounced, with the process of automated transcription and human correction about four times faster than manual transcription. 

Moreover, News Labs believes the experiment has demonstrated that these models can be used in “genuine workflow scenarios,” which might include, in the future, “translating World Service news articles in various languages into English, [linking and clustering] World Service language teams’ articles to measure the scale of a particular story’s impact across multiple languages.”