Several of Google’s efforts on the translation front have come to fruition in a flurry of activity halfway through May 2022 — just in time for the company’s annual I/O conference aimed at developers.
Just 24 hours ago, Google unveiled a prototype for smart glasses that are designed to transcribe, translate, and display what the user is saying on the lenses in real time.
And, earlier in the week, Google Translate finally followed through on a 2020 heads up on a “save to account” feature that saves search histories in association with user accounts. While a new prompt informs users they can back up their Translate search history to their Google accounts, users can still opt to access Google Translate without an account.
Moving from form to function: A May 11, 2022 product update on the Google blog said Google Translate has added 24 languages to its repertoire, bringing the total number of supported languages to 133.
Of the 24 new languages (spoken by more than 300 million people worldwide), eight are spoken in India; the others are from countries across Africa, South America, and Asia. The roster to be added to Google Translate includes the first English dialect as well as languages indigenous to the Americas.
These are the first languages added to Google Translate using zero-shot machine translation (MT). Google explored this method in a January 2022 paper on using “massively multilingual” MT to translate more than 200 languages, comparing the performance of a model without parallel data (i.e., monolingual) to a multilingual baseline.
At the time, Google did not indicate a timeline for incorporating this method into the user-facing Google Translate — but here it is.
While Google may be productizing zero-shot MT earlier than some expected, research on the method continues behind the scenes.
A May 2022 paper, Building MT Systems for the Next Thousand Languages, detailed efforts to create datasets for more than 1,500 languages, develop MT for those languages, analyze system outputs, and identify frequent errors.
The authors (all affiliated with Google Research) acknowledged that Google Translate’s limited menu of languages has historically skewed “European,” despite high speaker populations of languages spoken in Africa, South and Southeast Asia, as well as the indigenous languages of the Americas.
Researchers sidestepped the lack of parallel data for these languages by gathering monolingual web text, which they then used to build a multilingual unlabeled text dataset containing more than one million sentences.
They used this dataset and a parallel corpus spanning 112 languages to build massively multilingual models “capable of translating across 1,000 languages” — noting that the inclusion of more languages, large-scale back-translation, and self-training contributed to significant quality improvements among zero-resource languages.
Isaac Caswell, Senior Software Engineer at Google Translate and co-author of the May 2022 paper, wrote in Google’s product update: “While this technology is impressive, it isn’t perfect. And we’ll keep improving these models to deliver the same experience you’re used to with a Spanish or German translation, for example.”
Not a Translation Quality Error
Unfortunately for Google, a number of observers expressed skepticism about the quality of translation for those lower-resource languages because of a marketing error.
During the presentation at the I/O conference of Sundar Pichai, CEO of Google and Alphabet, a backdrop was meant to display the names of the 24 newest languages to Google Translate in their own scripts. But native speakers said the text was riddled with errors.
“Congrats to @Google for getting Arabic script backwards & disconnected,” Rami Ismail tweeted on May 11, 2022, “because small independent startups like Google can’t afford to hire anyone with a 4 year olds’ [sic] elementary school level knowledge of Arabic writing.”
Even news website TechCrunch got in on the action: “Erm, Google, that’s not how you write Arabic.”
“Or Urdu…it’s just a bunch of letters jumbled up,” chimed in another. Sam Ettinger summed up the issue: “Every single one that’s not Latin- or Cyrillic-based is wrong (at least a little bit).”
SlatorCon Remote June 2023 | Early Bird Now $120
A rich online conference which brings together our research and network of industry leaders.
One user suggested the problems arose from differences between text on screen and in print: “It has been this way since the very first version of Google Docs. The font looks fine on screen, but the ‘print’ version is effed. The indirect solution was the release of a new font.”
The explanation: When one does a “print” action of a default font, Google creates an image from that typeface, whether in a PDF or a slide (“present” mode). Hence, the consonant stacks and vowels appear shifted.
Despite the difficulties, even critics remain hopeful of future refinement. “The Translate team is awesome, I am genuinely moved to tears by how much work they do in so many languages,” Ettinger tweeted after pointing out issues with the text.
He added, “Seeing the list grow makes me remember scanning through it as a kid, awestruck to just learn these other languages existed, and I hope more people experience that.”