Slator recently covered the European Union’s latest attempt at machine translation: Translation for Massive Open Online Courses (TraMOOC). As explained in the article: TraMOOC “aims to leverage an online translation platform utilizing a wide array of linguistic infrastructure tools to reliably translate multimedia data of MOOCs from English into eleven other languages: German, Italian, Portuguese, Greek, Dutch, Bulgarian, Czech, Croatian, Polish, Russian and Chinese.”
Slator caught up with Dr. Joss Moorkens, one of the people helping with the effort, to find out more about the language translation project that EU’s Horizon 2020 awarded 3 million Euro. Dr. Moorkens is a post-doctoral researcher at the ADAPT Centre in the School of Computing and a lecturer in Multimedia Translation at the School of Applied Language and Intercultural Studies at Dublin City University. He clarified that the 3 million Euro investment was not only for the machine translation (MT) engine that would underlie the automated translation, but for the entire project, which would eventually also incorporate human translation via post-editing.
“We’re not really creating new MT technologies,” per se, Dr. Moorkens said. They’ll be creating new engines just for TraMOOC, of course, using the Moses statistical MT system. “What is novel about this is it will be able to plug into MOOC platforms” and perform automated translation fairly well by the time the project ends, Dr. Moorkens said. He also explained that they did not use any specific existing commercial MT solution as they need a domain-specific solution — something uniquely for MOOC platforms. What they’re doing is collecting parallel texts from various sources and integrating it into MT engines for each language pair, which they will then customize for use with MOOCs. Though a certain amount of the project is handled professionally, via partners like Deluxe Media.
“Aside from existing parallel texts from MOOCs and online courses, most of the data is out of domain,” Dr. Moorkens said. “DGT data, Ted Data — for Chinese news reports, the FBIS (Foreign Broadcast Information Service) data has been used for a lot of Chinese MT,” he elaborates, identifying some sources of data they will be incorporating into the TraMOOC MT engine. What they intend to do, according to Dr. Moorkens, is “use in- and out-of-domain data and make it domain specific for TraMOOC… [and] once the MT engine’s built, tune it to translate” domain-specific language.
Finally, at much later stages of the project, they are also looking to use crowdsourcing and post-editing, though they have yet to discuss how at their current stage. The end goal, to Dr. Moorkens’ knowledge, was to commercialize the project upon completion.
He referenced another project he had been working on for about a year now, and is currently near its end-of-life stage: the Falcon Project. This project is nearing its final phases and commercialization by industry partners. It is a web-based localization platform that incorporates proprietary technology from industry partners and brings it together with incrementally retrained statistical machine translation for deployment. In Falcon Project’s case, they incorporated computer-assisted translation (CAT) interface from XTM, Easyling’s website translation proxy, and Interverbum’s TermWeb terminology management suite on top of retrained and tuned statistical MT.