Real-Time Speech Translation Stars in Biggest OpenAI Release Since ChatGPT

Live Audio Translation Stars in Biggest OpenAI Release Since ChatGPT

The obligatory checkbox on OpenAI’s website, which requires visitors to verify that they are human, seems almost tongue-in-cheek following the company’s latest release, GPT-4o. (The “o” stands for “omni.”)

In a May 13, 2024 announcement, OpenAI described the newest version of its large language model as a “step towards much more natural human-computer interaction,” citing a range of new or improved capabilities, such as human-like response time in conversations and the interpretation of emotions through facial expressions. 

“With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network,” the press release explained.

TechCrunch reported that ChatGPT-4o is now “more multilingual,” with OpenAI claiming “enhanced performance in around 50 languages.”

Indeed, OpenAI’s press release includes a bar graph comparing BLEU scores for audio translation by OpenAI and several competitors. GPT-4o, according to OpenAI, earned the highest BLEU score, with Gemini a very close second. The company also noted “significant improvement on text in non-English languages.”

Live Audio Translation Takes Center Stage in OpenAI GPT-4o Release
Source: OpenAI

Real-time (live) translation is a perennial favorite among professionals and laypeople alike, inviting the inevitable comparisons to the literary babel fish, as well as waves of praise and underwhelm.

“GPT-4o broke a convention of contemporary interpreting by speaking in the third person”.

“Nobody tell them Google translate’s [sic] been doing this for years,” one observer noted in a post on X. Others disagreed, commenting that GPT-4o is “easier to use” and “slightly faster [with] less friction”

OpenAI’s demo showcases a brief conversation between OpenAI CTO Mira Murati, who asks, in Italian, what whales with the power of speech might ask humans.

“They might ask, how do we solve linear equations?” her interlocutor responds in English, an apparent callback to content from earlier in the demo. Interestingly, GPT-4o broke a convention of contemporary interpreting by speaking in the third person, rather than in the first person (with no noticeable effect on participants’ comprehension) — a fact noted on X by commentators pushing back on the inevitable “RIP translators” hot takes.

Beyond translation and interpreting, many observers also pointed out that shares in language learning app Duolingo dropped 3% during the OpenAI announcement.

Dr. Jim Fan, a Senior Research Manager at NVIDIA, described the generated voice as “lively and even a bit flirty. GPT-4o is trying (perhaps a bit too hard) to sound like HER,” referring to the 2013 film in which a man falls in love with an AI virtual assistant voiced by Scarlett Johansson. 

“It’s a pivot towards more emotional AI with strong personality, which OpenAI seemed to actively suppress in the past,” Fan concluded.

OpenAI began to roll out GPT-4o’s text and image capabilities on May 13, 2024. According to the press release, “We plan to launch support for GPT-4o’s new audio and video capabilities to a small group of trusted partners in the API in the coming weeks.”