Large language models (LLMs) have been described as “black boxes” — researchers know what they put in and what comes out, but what happens in between remains a bit of a mystery.
Now, a new paper from Google tries to peek inside the black box of the Pathways Language Model (PaLM) to understand how an LLM’s exposure to bilingual signals can help it translate.
Eleftheria Briakou, a PhD candidate at the University of Maryland, is the lead author of the May 17, 2023 paper, “Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM’s Translation Capability,” with co-authors Colin Cherry and George Foster, both of Google.
While supervised models are explicitly and intentionally exposed to translation data, LLMs are able to perform zero-shot translation — that is, they can translate despite not having been trained to do so.
PaLM is a 540bn-parameter Transformer model with a pretraining dataset of 780bn tokens composed of multilingual sources: “social media conversations (50%), filtered webpages (27%), Wikipedia (4%), presumably English sources like books (13%) and news articles (1%), and source code (5%).”
First, researchers tagged all training data containing bilingual instances (with one of the two languages being English).
They found that 55% of bilingual instances were not translations, but rather code-switching, especially on social media (10%); references to named entities in their native language (21%); and completely unrelated content juxtaposed on the same webpage (24%).
Forty percent of bilingual instances fell under the loose categorization of translations, either typical (20%) or semantically related but not exactly translation (20%), which included summarization and paraphrasing.
“We also spotted a few cases of forum discussions around explanations of translation or stylistic manipulation of translations,” the authors noted.
Ultimately, the researchers found that 1.4% of PaLM’s training instances in natural language were bilingual, and 0.34% contained at least one translated pair into or from English.
The number of monolingual instances in each language correlated with the number of instances containing bilingual or translated content for that language.
The authors pointed out that the number of translated pairs is a “lower bound on the amount of incidental bilingualism and translation that PaLM consumes, as we are restricted to a specific set of language pairs, and we only study bilingualism with English.”
PaLM’s Preferred Prompts
Of the 44 languages the group was able to identify, which also had FLORES-101 evaluation data, four were high-resource, 11 were medium-resource, and the remaining 19 were low-resource.
Researchers then explored the most effective natural prompts to trigger PaLM’s translation abilities.
According to the authors, the default prompt used by most machine translation research with LLMs consists of source and target language names in English, followed by a colon (e.g., “French:”). Empirically, this was also the most frequent translation prompt found in the data.
Alternative prompts include using ISO language codes for the source and target languages (e.g., “FR:”); source and target language names in their respective languages (e.g., “Français”); and the name of the source language in English plus the word “translation” in the target language (e.g., “Traduction:”).
“Interestingly, prompt types are not evenly distributed across our language groups,” they wrote. “Language codes appear primarily with high-resource languages, while low-resource languages favor prompts written in their native language.”
While the findings apply to quantifying bilingualism on a limited set of language pairs — all of which included English — the authors concluded that data-driven prompts from incidental translation can improve zero-shot translation into English.