How Good Are Subtitling Tools?

Subtitle Quality

New technologies behind subtitles are good enough to actually drive the demand for them, encouraging companies, for example, to explore new markets sooner than they might have otherwise. 

And while American media has long been subtitled for non-English-speaking fans around the world, international content enjoyed a surge in popularity in the US during Hollywood’s (recently-ended) strike, often accompanied by subtitles.

It is a virtuous cycle that finds language service providers (LSPs) experimenting with the level of automation within a workflow and the corresponding involvement of a human expert-in-the-loop.

The traditional subtitle creation workflow consists of two steps: transcription and translation. The primary technologies available to make those processes more efficient and less expensive are automatic speech recognition (ASR) and machine translation (MT). 

ASR technology processes human speech and converts it into text. With developments in acoustic modeling, NLP, and deep learning, ASR has already become more accurate across a wider range of languages, speaker types, and audio environments.

LSPs continue to explore workflows centered in ASR for multilingual subtitle creation, but True Subtitles Founder Mara Campbell wrote in 2019 that even the worst-performing ASR tools save human subtitlers time — and their quality has already advanced significantly since then.

MT, meanwhile, typically offers high quality for high-resource language pairs and weaker performance for low-resource languages. 

This is also typically the case for large language models (LLMs); the machine learning models with the ability to process and “understand” human language. Most LLMs are generalists that perform a wide range of tasks but are able to perform well on specific tasks, such as translation, with fine-tuning and prompting.

The advent of LLMs is significant for MT in that they take into account a large amount of context (i.e., data) when they translate, resulting in more coherent and appropriate translations.

LLMs have their drawbacks, of course, such as gender bias, difficulty capturing emotion, and nonsensical output, among others.

But they also offer exciting possibilities: Multimodal models, for instance, might be able to consider accompanying visuals when deciding on a translation. Using visual prompts and images as context for MT has the potential to drastically change MT for subtitling.