Researchers Propose First ‘Transcription-Free’ Automatic Subtitling Model

Eliminating Transcript Dependency in Automatic Subtitling

In a May 17, 2024 paper, Marco Gaido, Sara Papi, Matteo Negri, Mauro Cettolo, and Luisa Bentivogli from the Fondazione Bruno Kessler introduced a new approach to automatic subtitling (AS) that eliminates the reliance on intermediate transcripts for timestamp prediction.

The researchers explained that subtitles consist of text blocks with corresponding time durations to ensure synchronization with the video, enhancing viewers’ experience. Automating subtitling involves three main tasks: translating spoken content, segmenting the translated text, and estimating timestamps for each segment.

Early subtitling systems used a cascade architecture with automatic speech recognition (ASR) and machine translation (MT), relying heavily on transcripts for all tasks. However, this approach has limitations such as error propagation, loss of useful prosodic information, inapplicability to languages without written forms, and increased computational and environmental costs.

In response to these limitations, recent research has shifted towards transcription-free solutions for translation and segmentation by using direct speech-to-text translation systems and adapting MT and language models for subtitle segmentation. 

While translation and segmentation have received attention, the direct generation of timestamps has received “much less attention,” according to the researchers. Current approaches still rely on transcripts for timestamp estimation, involving generating captions, estimating timestamps, and projecting them onto target subtitles.

To that end, they presented a model that completely eliminates the need for intermediate transcripts, even for timestamp prediction. According to the researchers, this is “the first fully end-to-end AS solution that seamlessly produces both subtitles (i.e., segmented translations) and their timestamps without any reliance on intermediate transcripts.”

The researchers proposed two main methods for timestamp estimation: one that uses the Connectionist Temporal Classification (CTC) loss to align audio directly with the translated subtitles, and another that estimates the temporal alignment between the audio and subtitles using the attention mechanism. (CTC is an algorithm often used in speech recognition to handle situations where there is no explicit information about alignment between the input and output.)

Besides that, they introduced SubSONAR, a new metric designed to evaluate timestamp quality. They explained that “current metrics are by design holistic and therefore inadequate to precisely measure timestamp estimation quality.” 

Unlike other metrics, SubSONAR is specifically sensitive to time shifts, thus enabling a focused assessment of timestamp accuracy. 

SOTA Results

The researchers validated the effectiveness of the proposed model through extensive experiments on seven language pairs, under two data conditions, and across four different domains using both automatic and human evaluations. The new model achieved “state-of-the-art results”, outperforming existing cascade architectures in automatic subtitling. 

Specifically, manual evaluations indicated a significant reduction in the need for timestamp adjustments by approximately 24% compared to previous methods. The researchers noted that the human evaluation was “the first-ever manual evaluation of timestamp quality in subtitling.” 

Their analysis also revealed not only a reduced need for timestamp adjustments with errors occurring less often but also less severe errors compared to other methods. “Our experiments evidence that our proposed solution can effectively close the gap between cascade and direct subtitling systems for the first time,” they said.

The code and pre-trained models are available on GitHub, facilitating further research and practical applications in the field.