Salesforce Just Open-Sourced a Large, XML-Tagged Machine Translation Dataset

Salesforce Machine Translation Research on XML tagged online help content

Training neural machine translation (NMT) engines with XML tags can improve translation accuracy when working with text data, according to a June 2020 paper published by a team of researchers at Salesforce.

As part of this research, the team, which includes language industry veteran Teresa Marshall, Vice President of Globalization and Localization at Salesforce, made available on Github a dataset that draws on the software company’s professionally-translated online help documentation.

The entire dataset covers 17 languages — any of which can be used as a source or target language — and includes about 7,000 pairs of XML files for each language pair.

“Our work is unique in that we focus on how to translate text with XML tags, which is practically important in localization,” lead researcher Kazuma Hashimoto told Slator.

SlatorCon Remote March 2024 | 50 Super Early Bird Tickets at $120

SlatorCon Remote March 2024 | 50 Super Early Bird Tickets at $120

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

A new dataset was necessary for the team’s research, as widely used datasets of plain text do not reflect the fact that “text data on the Web is often wrapped with markup languages to incorporate document structure and metadata such as formatting information,” the researchers explained.

“We decided to publish our new dataset so that people can use it if interested, and we can also gain significant benefit if they report interesting solutions to our task,” Hashimoto said, pointing out that the source data, online help for Salesforce customers, was already publicly available.

Looking ahead, the team wrote, “As our dataset represents a single, well-defined domain, it can also serve as a corpus for domain adaptation research (either as a source or target domain).”

Including XML Tags Improves Quality

According to the paper, this online help text has been localized and maintained for 15 years by the same localization service provider and in-house localization program managers. 

“At every release, we run our system to translate the content in English to other target languages, and then human experts verify the quality and perform post-editing to meet the quality demand,” Hashimoto said.

Drawing on this multilingual content, the researchers created datasets for seven English-based language pairs (English to Dutch, Finnish, French, German, Japanese, Russian, and Simplified Chinese) and one non-English pair, Finnish to Japanese.

The group performed baseline experiments on NMT output with XML tags removed (i.e., plain text) and compared them to experiments on NMT output with XML tags included. 

The team trained three models for each language pair: one trained only with text, without XML; one trained with XML; and one trained with XML and with copy mechanisms, which copy XML elements from the original source text. 

“Our work is unique in that we focus on how to translate text with XML tags, which is practically important in localization”

For the plain text NMT, “including segment-internal XML tags tends to improve the BLEU scores,” the authors wrote, which “is not surprising because the XML tags provide information about explicit or implicit alignments of phrases.” This was not the case, however, for English to Finnish, “which indicates that for some languages it is not easy to handle tags within the text.” 

Similarly, the model trained with both XML and copy mechanisms achieved the best BLEU scores for both plain text and text with XML tags across all language pairs, except for English to French plain text.

“We expected that tagged text would be helpful in improving translation accuracy,” Hashimoto said, “especially when the training dataset size is limited, as in our specific use case, compared with very general machine translation work in existing research papers.”

The researchers also encountered a typical error, undertranslation, when they found that the underlined phrase “for example” was missing in certain translation results, despite the fact that the dataset’s BLEU scores were higher than those of other, standard public datasets. For this reason, and because online help translations must be accurate, the authors concluded that NMT should be used “for the purpose of helping the human translators” perfect final translations.

Although human evaluators identified more than 50% of the translation results as “complete” or “useful in post-editing,” translators still spent a significant amount of time verifying MT and correcting MT errors.

Ideally, future translation models that take into account Web-structured text “may help human translators accelerate the localization process,” according to the paper’s authors, whose future work will explore “the effectiveness of using the NMT models in the real-world localization process where a translation memory is available.”