logo image
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • SlatorCon Remote May 2021
    • Localizing at Scale for International Growth
    • Design Thinking May 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs
MENU
  • News
    • People Moves
    • Deal Wins
    • Demand Drivers
    • M&A and Funding
    • Financial Results
    • Technology
    • Academia
    • Industry News
    • Features
    • Machine Translation
    • — Divider —
    • Slator Pro
    • — Divider —
    • Press Releases
    • Sponsored Content
  • Data & Research
    • Research Reports & Pro Guides
    • Language Industry Investor Map
    • Real-Time Charts of Listed LSPs
    • Language Service Provider Index
  • Podcasts & Videos
  • Events
    • SlatorCon Remote May 2021
    • Localizing at Scale for International Growth
    • Design Thinking May 2021
    • — Divider —
    • SlatorCon Coverage
    • Other Events
  • Directory
  • RFP Center
  • Jobs

Register Before April 15th for SlatorCon Remote and Save 15%!

  • Slator Market Intelligence
  • Slator Advertising Services
  • Slator Advisory
  • Login
Search
Generic filters
Exact matches only
Advertisement
Salesforce Just Open-Sourced a Large, XML-Tagged Machine Translation Dataset

9 months ago

July 29, 2020

Salesforce Just Open-Sourced a Large, XML-Tagged Machine Translation Dataset

Machine Translation ·

by Seyma Albarino

On July 29, 2020

9 months ago
Machine Translation ·

by Seyma Albarino

On July 29, 2020

Salesforce Just Open-Sourced a Large, XML-Tagged Machine Translation Dataset

Training neural machine translation (NMT) engines with XML tags can improve translation accuracy when working with text data, according to a June 2020 paper published by a team of researchers at Salesforce.

As part of this research, the team, which includes language industry veteran Teresa Marshall, Vice President of Globalization and Localization at Salesforce, made available on Github a dataset that draws on the software company’s professionally-translated online help documentation.

The entire dataset covers 17 languages — any of which can be used as a source or target language — and includes about 7,000 pairs of XML files for each language pair.

Advertisement

“Our work is unique in that we focus on how to translate text with XML tags, which is practically important in localization,” lead researcher Kazuma Hashimoto told Slator.

SlatorCon Remote May 2021 | Early Bird $ 110

SlatorCon Remote May 2021 | Early Bird $ 110

A rich online conference which brings together our research and network of industry leaders.

Register Now

A new dataset was necessary for the team’s research, as widely used datasets of plain text do not reflect the fact that “text data on the Web is often wrapped with markup languages to incorporate document structure and metadata such as formatting information,” the researchers explained.

“We decided to publish our new dataset so that people can use it if interested, and we can also gain significant benefit if they report interesting solutions to our task,” Hashimoto said, pointing out that the source data, online help for Salesforce customers, was already publicly available.

Looking ahead, the team wrote, “As our dataset represents a single, well-defined domain, it can also serve as a corpus for domain adaptation research (either as a source or target domain).”

Including XML Tags Improves Quality

According to the paper, this online help text has been localized and maintained for 15 years by the same localization service provider and in-house localization program managers. 

“At every release, we run our system to translate the content in English to other target languages, and then human experts verify the quality and perform post-editing to meet the quality demand,” Hashimoto said.

Slator 2020 How to Run a Translation and Localization RFP - Procurement

Pro Guide: How to Run a Translation and Localization RFP

Data and Research, Slator reports
25 pages. Actionable guidance for translation and localization buyers on how to qualify vendors and streamline procurement.
$375 BUY NOW

Drawing on this multilingual content, the researchers created datasets for seven English-based language pairs (English to Dutch, Finnish, French, German, Japanese, Russian, and Simplified Chinese) and one non-English pair, Finnish to Japanese.

The group performed baseline experiments on NMT output with XML tags removed (i.e., plain text) and compared them to experiments on NMT output with XML tags included. 

The team trained three models for each language pair: one trained only with text, without XML; one trained with XML; and one trained with XML and with copy mechanisms, which copy XML elements from the original source text. 

“Our work is unique in that we focus on how to translate text with XML tags, which is practically important in localization”

For the plain text NMT, “including segment-internal XML tags tends to improve the BLEU scores,” the authors wrote, which “is not surprising because the XML tags provide information about explicit or implicit alignments of phrases.” This was not the case, however, for English to Finnish, “which indicates that for some languages it is not easy to handle tags within the text.” 

Similarly, the model trained with both XML and copy mechanisms achieved the best BLEU scores for both plain text and text with XML tags across all language pairs, except for English to French plain text.

“We expected that tagged text would be helpful in improving translation accuracy,” Hashimoto said, “especially when the training dataset size is limited, as in our specific use case, compared with very general machine translation work in existing research papers.”

The researchers also encountered a typical error, undertranslation, when they found that the underlined phrase “for example” was missing in certain translation results, despite the fact that the dataset’s BLEU scores were higher than those of other, standard public datasets. For this reason, and because online help translations must be accurate, the authors concluded that NMT should be used “for the purpose of helping the human translators” perfect final translations.

Slator 2020 Language Industry Market Report

Data and Research, Slator reports
55 pages. Total market size, biz dev and sales insights, TMS & MT review, buyer segment analysis, M&A, Covid impact & outlook.
$480 BUY NOW

Although human evaluators identified more than 50% of the translation results as “complete” or “useful in post-editing,” translators still spent a significant amount of time verifying MT and correcting MT errors.

Ideally, future translation models that take into account Web-structured text “may help human translators accelerate the localization process,” according to the paper’s authors, whose future work will explore “the effectiveness of using the NMT models in the real-world localization process where a translation memory is available.”

TAGS

machine translationMTneural machine translationNMTPEMTpost-editingSalesforceTeresa Marshall
SHARE
Seyma Albarino

By Seyma Albarino

Staff Writer at Slator. Linguist, music blogger and reader of all things dystopian. Based in Chicago after adventures on three continents.

Advertisement

SUBSCRIBE TO THE SLATOR WEEKLY

Language Industry Intelligence
In Your Inbox. Every Friday

SUBSCRIBE

SlatorSweepSlatorPro
ResearchRFP CENTER

PUBLISH

PRESS RELEASEDIRECTORY LISTING
JOB ADEVENT LISTING

Bespoke advisory including speaking, briefings and M&A

SLATOR ADVISORY
Advertisement

Featured Reports

See all
Pro Guide: Translation Pricing and Procurement

Pro Guide: Translation Pricing and Procurement

by Slator

Slator 2020 Language Industry M&A and Funding Report

Slator 2020 Language Industry M&A and Funding Report

by Slator

Slator 2021 Data-for-AI Market Report

Slator 2021 Data-for-AI Market Report

by Slator

Slator 2020 Medtech Translation and Localization Report

Slator 2020 Medtech Translation and Localization Report

by Slator

Press Releases

See all
LocHub Announces QA Localization Solution For Multilingual Content Publishing Processes

LocHub Announces QA Localization Solution For Multilingual Content Publishing Processes

by Xillio

Former TrustPoint Translations CEO Joins XTRF Advisory Board

Former TrustPoint Translations CEO Joins XTRF Advisory Board

by XTRF

Global Ready Conference Lineup Announced

Global Ready Conference Lineup Announced

by Smartling

Upcoming Events

See All
  1. Smartling - Global Ready Conference 2021

    Global Ready Conference

    by Smartling

    · April 14

    When you can't traverse the world, let the world come to you. Join our annual global event from home.

    More info FREE

Featured Companies

See all
Sunyu Transphere

Sunyu Transphere

Text United

Text United

Memsource

Memsource

Wordbank

Wordbank

Protranslating

Protranslating

SeproTec

SeproTec

Versacom

Versacom

Smartling

Smartling

XTM International

XTM International

Translators without Borders

Translators without Borders

STAR Group

STAR Group

memoQ Translation Technologies

memoQ Translation Technologies

Advertisement

Popular articles

Google Translate Not Ready for Use in Medical Emergencies But Improving Fast — Study

Google Translate Not Ready for Use in Medical Emergencies But Improving Fast — Study

by Seyma Albarino

The Slator 2021 Language Service Provider Index

The Slator 2021 Language Service Provider Index

by Slator

DeepL Adds 13 European Languages as Traffic Continues to Surge

DeepL Adds 13 European Languages as Traffic Continues to Surge

by Marion Marking

SlatorPod: The Weekly Language Industry Podcast

connect with us

footer logo

Slator makes business sense of the language services and technology market.

Our Company

  • Support
  • About us
  • Terms & Conditions
  • Privacy Policy

Subscribe to the Slator Weekly

Language Industry Intelligence
In Your Inbox. Every Friday

© 2021 Slator. All rights reserved.

Sign up to the Slator Weekly

Join over 13,800 subscribers and get the latest language industry intelligence every Friday

Your information will never be shared with third parties. No Spam.