Good, Bad or Loose: Amazon’s New Subtitle Quality Estimation System

Amazon Prime Video Subtitle Localization and Translation

What are you doing to stay occupied during the lockdown? Baking, DIY, yoga, Xbox? For many, devouring new movies and TV programs is now more than just an enjoyable way to unwind in an evening — streaming content has become an essential mechanism for coping during the weeks and months spent confined at home during the lockdowns.

Streaming platforms such as Amazon Prime Video, Netflix, Hulu and others are well aware that delivering a good viewing experience is essential to keeping their audiences captive. For foreign-language content, the quality of subtitles can be the difference between viewers staying tuned or switching off.

Two researchers from Amazon Prime Video, International Expansion, Prabhakar Gupta and Anil Nelakanti, observed in a recent paper that “providers increase their viewership by enabling content in local languages.” At the same time, they said, “low translation quality can cause increased usage drop-off and hurt content viewership for audience of target language.” So getting the quality right is crucial. 

A new research paper by Amazon explores the topic of quality estimation (QE) in translated subtitles. The paper, entitled “DeepSubQE: Quality estimation for subtitle translations,” presents a new system for estimating the quality of translated subtitles, whether human- or machine-generated. The DeepSubQE system reduces both cost and time in subtitle translation while assuring quality, the researchers said.

“Low translation quality can cause increased usage drop-off and hurt content viewership for audience of target language” — Prabhakar Gupta and Anil Nelakanti, Amazon Prime Video, International Expansion

Subtitling and dubbing are indeed topical for Amazon; As the researchers explained, the “digital entertainment industry is growing multifold with ease of internet access and numerous options for on-demand streaming platforms.” And, although a separate paper published by Amazon in January 2020 looked into machine dubbing, Gupta and Nelakanti pointed out that “translation of subtitles across languages is a preferred cost effective industry practice to maximize content reach.” 

Subtitling may be a more cost-efficient approach than dubbing, but the researchers also noted that using human translators to localize subtitles is expensive, and the “man-power cost” increases significantly in the case of low-resource target languages. Since a second person normally checks the translated subtitles to improve the quality of the translation where needed, quality evaluation is “as expensive as generating the translation itself,” they said.

Exploring an automated approach for estimating the quality of translated subtitles is therefore logical, and all the more so for Amazon, who is among the biggest players in the on-demand streaming space.

Good, Bad or Loose

Clearly, quality estimation for any kind of translation is challenging because there is more than one way to translate a given sentence into a specific target language. Legitimate translation techniques like paraphrasing, rephrasing and idiom, which are frequently used in subtitling translation, complicate the matter further since they confound binary methods of quality estimation. 

Most QE methods are binary methods, however, and simply classify translations as “good” or “bad.” They fail to deal with ““loosely” translated samples that often occur due to human judgment.” 

Under Amazon’s DeepSubQE model, which introduces a “loose” translation category, a good translation is one that retains “all meaning from [the] source and reads fluently;” A bad translation is one that bears no resemblance to the meaning of the original and is “disconnected from the context in the video;” A loose translation is one that uses paraphrasing, colloquialism, idiom, or “contains some contextual information not available in [the] source text.” 

The researchers also noted how categorizing subtitle translations into the three categories helps to indicate what further work the translations require, if any. Good translations are fine as is and need no further improvement, while loose translations may require human post-edits and bad translations need a “complete rewrite,” they said.  

The researchers worked with 30,000 video subtitle files in English and their corresponding translations in French, German, Italian, Portuguese and Spanish for their experiments. They found that the DeepSubQE model was accurate in its estimations more than 91% of the time for all five languages. The system performed slightly better for longer sentences.

One area of improvement they noted for DeepSubQE was related to the operational load. Currently, the system relies on training one model per language pair, which requires “considerable operational load,” the researchers said. However, future work could explore using a multilingual model, which would help to reduce the load and also be of benefit to “resource starved languages,” they added.