AWS Presents a New Dataset for Multi-Modal Document Translation

AWS Presents a New Dataset for Multi-Modal Document Translation

In a June 12, 2024 paper, researchers from AWS AI Labs, the University of Maryland, and the Nara Institute of Science and Technology presented M3T, a multi-modal benchmark dataset designed to evaluate machine translation (MT) systems for translating visually rich, semi-structured documents. 

The researchers explained that most document-level MT systems focus on textual content at the sentence level and disregard the visual cues like paragraphs, headers, and the overall document structure, which are important for understanding the context and relationships between different sections of text. “Visual cues represent an important yet overlooked set of features which can provide contextual clues,” they said.

They also noted that most MT systems assume perfect text extraction from documents. However, extracting text from documents using optical character recognition (OCR) often results in errors, especially in documents with complex layouts (e.g., multiple columns, tables).

This new dataset addresses these shortcomings by including complex text layouts, which are typical in real-world documents such as PDFs. The researchers emphasized that “M3T focuses on PDF documents which are a commonly utilized format that pose several challenges for modern language models.”

Bridging the Evaluation Gap

With recent advances in multi-modal MT models that combine visual encoders with large language models (LLMs), the development of multi-modal MT models capable of handling visually and textually complex tasks like translation is within reach. 

Benchmarking model performance on challenging document understanding tasks that consider long-range contextual clues is increasingly important. 

“This dataset aims to bridge the evaluation gap in document-level NMT systems, acknowledging the challenges posed by rich text layouts in real-world applications,” the researchers said.

Unique Dataset

The dataset consists of over 200,000 document-level image-text pairs in 8 language pairs, making it the largest multimodal machine translation dataset to date. The image-text pairs were sourced from a variety of online sources, including news articles, blog posts, and educational materials covering this way various domains and layout complexities.

Annotators labeled the documents with layout information. Then, the documents were machine-translated and professional translators post-edited the documents, ensuring the context of the document was considered.

“Our dataset is unique in that it focuses on machine translation at the document level and test models on both their ability to translate and their ability to use visual features as contextual clues,” they said.

The researchers evaluated several existing multimodal MT models, such as LLaVa, on the M3T dataset. They found that incorporating visual features improved the translation quality of OCR’d text. However, the improvements were not significant, indicating that the models struggled to effectively utilize the visual information, particularly at the document level.

According to the researchers, future researchers should explore and develop more effective methods to fully leverage the contextual information provided by visual elements, concluding that “multi-modal document translation is still an area for future research.”

The dataset and scripts are available on GitHub

Authors: Benjamin Hsu, Xiaoyu Liu, Huayang Li, Yoshinari Fujinuma, Maria Nadejde, Xing Niu, Yair Kittenplon, Ron Litman, and Raghavendra Pappagari