In a paper published on December 18, 2023, researchers Syeda Nahida Akter, Zichun Yu, Aashiq Muhamed, Tianyue Ou, Alex Bäuerle, Ángel Alexander Cabrera, Krish Dholakia, Chenyan Xiong, and Graham Neubig from Carnegie Mellon University and BerriAI explored the translation abilities of Google’s Gemini, highlighting it as a “valuable tool.”
The researchers explained that the recently introduced Google Gemini models are the first to comprehensively report results rivaling the OpenAI GPT series across diverse tasks. However, a significant drawback is noted: the absence of released evaluation details and model predictions. “The exact evaluation details and model predictions have not been released, limiting the ability to reproduce, inspect, and analyze the results and their implications in detail,” they said.
To address this, the researchers conducted a “third-party, objective comparison” between OpenAI GPT and Google Gemini models, providing “reproducible code and fully transparent results.” Beyond translation, the evaluation covered other tasks, such as reasoning, knowledge-based question answering, math problem solving, code generation, and instruction following.
The researchers compared Gemini Pro, GPT-3.5 Turbo, and GPT-4 Turbo against established systems like Google Translate and benchmarked them against NLLB-MoE, an open-source machine translation (MT) model known for its extensive language coverage.
These models were evaluated across 20 languages with various levels of resource availability and translation difficulty, looking particularly at how well the models performed with translations from English to other languages (ENG→X). To evaluate the outputs, the researchers used standard metrics, such as BLEU and chrF2++.
A Valuable Tool
While Google Translate outperformed other models, excelling in 10 languages, the language models demonstrated competitive performance but fell short in translation into non-English languages.
GPT-4 Turbo showcased performance deviations compared to GPT-3.5 Turbo and Gemini Pro. Notably, GPT-4 Turbo demonstrated larger improvements for low-resource languages, whereas performance was similar between the large language models (LLMs) for high-resource languages.
Gemini Pro outperformed both GPT-3.5 Turbo and GPT-4 Turbo in five out of 20 languages, achieving top performance in three languages. However, it exhibited a tendency to block responses in scenarios of lower confidence in approximately 10 language pairs. The researchers attributed Gemini Pro’s lower performance in some languages to this tendency.
Translation (FLORES)— Graham Neubig (@gneubig) December 19, 2023
* Gemini's achieves low scores in 12 languages because it fails to respond at all
* But when it responds, Gemini is good at translation! It outperforms GPT-4 and GPT-3.5 Turbo pic.twitter.com/3iRNpbMVkj
A closer examination revealed that Gemini Pro marginally outperformed GPT-3.5 Turbo and GPT-4 Turbo in unblocked samples, where it demonstrated higher confidence. Specifically, it surpassed GPT-4 Turbo by 1.6 chrf in 5-shot and 2.6 chrf in 0-shot settings, and exceeded GPT-3.5 Turbo by 2.7 chrf and 2 chrf in 5-shot and 0-shot settings, respectively.
Despite the observed challenges in translating certain samples, the authors emphasized Gemini Pro’s competitive performance over other models on Cyrillic scripts, in contrast to its underperformance on other scripts. GPT-4 stood out, outperforming both Gemini Pro and GPT-3.5 Turbo across various scripts, and it was particularly effective in languages using the Devanagari script.
The authors concluded with a recommendation for researchers and practitioners to consider Gemini Pro as a “valuable tool in their toolkit, comparable to GPT-3.5 Turbo.”
Despite acknowledged limitations, the study provided a transparent and reproducible analysis, inviting the community to explore and scrutinize the findings. For those interested in reproducing the results, the code and data can be found at https://github.com/neulab/gemini-benchmark.