Large language models (LLMs) have already shown what they can do across a wide variety of tasks. The question driving this competitive upscaling: Will language models keep getting better as they grow bigger?
Here’s the LLM recipe if you want to take a stab at finding out.
To create an LLM following Google’s approach, you’ll need 780 billion language segments from
- Social media conversations
Combine at proportions of 50%, 27%, 13%, 5%, 4%, and 1%, respectively.
1. Get your equipment ready. Take 6,144 computer chips designed for machine learning.
2. Arrange chips in a beehive pattern to create a “pod.” Repeat.
3. Select and apply a Transformer learning model.
4. Add ingredients to the pods.
5. Process thoroughly. (Tip: Switch off other electrical appliances first.)
6. Continue to fold in ingredients, allowing parallel processing within and across pods.
7. Your Transformer will look closely at the ingredients and decide how they relate to each other; and then depict those relationships as numbers (parameters).
8. Stop when you have at least 540 billion parameters.
When you’re finished, plate! Your large language model is now ready to understand and generate language.
Scaling a model of PaLM’s size is no small feat. Google used its proprietary machine learning chips (TPUs) and connected them into supercomputers called pods. A single pod can generate one exaflop per second of computing power.
To put that in perspective, if you wanted to match what one exaflop could do in one second, you need to perform one calculation per second — for 31 billion years.
Google researchers used its now-default Transformer machine learning architecture along with Pathways, a new system that orchestrates the processing of data across multiple pods in parallel, resulting in more efficient training.
Google’s longer-term vision is to create large, multi-talented models that can easily switch between specialized tasks. Such a model could, for instance, play chess one moment, before diagnosing a disease, or predicting how floods flow through terrain, or translating a poem.
So, how does PaLM perform? Google researchers tested the model on typical language tasks (e.g., question answering, filling missing words in sentences). They also wanted to explore the frontiers of what was possible, giving the model a set of 150 particularly challenging tasks known as BIG-bench (or the Beyond the Imitation Game Benchmark dataset).
- Distinguishing between a literal sentence and a metaphorical one;
- Distinguishing cause and effect;
- Determining if a text is intended to be a joke or not; and
- Finding statements that strengthen or weaken logical arguments.
Researchers focused on “few-shot learning”. That is, how well could the model perform on specific tasks, even though it had not been trained with task-specific “ingredients.” Just a handful (“few”) task-relevant examples (“shots”) were provided to PaLM.
The result according to Google: On standard tasks, PaLM did better than prior large models in almost all cases. And on the difficult tasks, PaLM achieved breakthrough capabilities.
The model can, for example, find synonyms, distinguish cause and effect, explain jokes, and guess a movie title from an emoji description (use case: Deadpool billboard).
Google also achieved new advances in reasoning — both mathematical and common sense — helped along by a strategy called “chain of thought prompting,” where the model is induced to answer questions in a series of short sentences.
In solving many of these tasks, Google said, PaLM does better than the average person.
Moreover, when plotting PaLM’s performance against its scale, Google concluded that “performance improvements from scale have not yet plateaued.” In other words, language models will likely get better as they grow bigger.
Translation on the Side
Incidentally, PaLM is also quite good at machine translation and other multilingual tasks.
While most training data (78%) was in English (the balance comprising over 100 different languages), and parallel data was not explicitly included, PaLM matched the performance of specialized machine translation models — at least, for language pairs that included English.
In its paper, Google hinted at future models that will be trained on a larger proportion of multilingual data.
Reactions to PaLM — and to big tech’s announcements about LLMs in general — have been mixed.
Hyperia CEO, Elliot Turner, tweeted, “The new 540B parameter language model (PaLM) from Google is mind-blowing. It scores better than the ‘average human’ on 150 different cognitive tasks. If model scaling holds up, 50–100 trillion parameters (100x bigger than this model) should beat ‘best human’.”
YouTuber Yannic Kilcher was also enthusiastic, saying, “It’s fair to say that these models are becoming the Swiss Army Knife of natural language processing tasks.”
Others struck a wearier tone. “Anyone else feel burned out by a new AI breakthrough every week?” tweeted Soumith Chintala of Meta AI. “Trying to keep up but it goes by so fast.”
Intento Co-founder and CTO, Grigory Sapunov, told Slator that the large model trend is likely to continue for some time. “We see that increasing the model size gives more capabilities, some of which cannot be predicted in advance. So it’s worth probing the frontier to understand what is possible,” he added.
However, Sapunov and a number of pundits pointed out that size is not the single most important aspect of any model. Researchers on DeepMind’s Chinchilla, for instance, found that the relative size of the training dataset is just as important. ADAPT has also shown that good performance is possible with smaller, optimized models.
So are large models really necessary? Not according to Julien Simon of Hugging Face, an open-source NLP company. “Am I excited by Megatron-Turing NLG 530B and whatever beast is coming next? No. Do I think that the (relatively small) benchmark improvement is worth the added cost, complexity, and carbon footprint? No.”
The scientific community also expressed skepticism around Google’s decision not to publish its PaLM model or follow machine learning reproducibility checklists.
Robust.AI’s Gary Marcus tweeted, “This week seems like a win for AI, but it’s actually a step back. Sparse disclosure of methods and errors. Anecdotal data only. Cherry-picking. No access for scientific community.”
Neuroscientist Nathan Whitmore chimed in, “What is the point of developing these things if they’re never released in a way that can be widely used/studied?”
The question of how close LLMs are getting to human intelligence was unpacked by Melanie Mitchell, Professor at the Santa Fe Institute, in a twitter thread. “Very impressive” she said, however, “I will not mistake them for progress toward general intelligence.”
Nevertheless, large language models can be useful as foundations for a range of language services, helping scale content tasks such as rewriting sentences with the correct tone, or generating useful summaries.
Startups CopyAI and OthersideAI use LLMs as a basis for their text-generation services. In this respect, Intento’s Sapunov pointed out, “The future is definitely bright and many new products and services will emerge.”