How Large Language Models Prove Chomsky Wrong with Steven Piantadosi

SlatorPod #167 - Steven Piantadosi on How Large Language Models Prove Chomsky Wrong

Joining SlatorPod this week is Steven Piantadosi, Associate Professor of Psychology at UC Berkeley. Steven also runs the computation and language lab (colala) at UC Berkeley, which studies the basic computational processes involved in human language and cognition.

Steven talks about the emergence of large language models (LLMs) and how it has reshaped our understanding of language processing and language acquisition.

Steven breaks down his March 2023 paper, “Modern language models refute Chomsky’s approach to language”. He argues that LLMs demonstrate a wide range of powerful language abilities and disprove foundational assumptions underpinning Noam Chomsky’s theories and, as a consequence, negate parts of modern Linguistics.

Steven shares how he prompted ChatGPT to generate coherent and sensible responses that go beyond its training data, showcasing its ability to produce creative outputs. While critics argue that it is merely an endless sequence of predicting the next token, Steven explains how the process allows the models to discover insights about language and potentially the world itself.

Subscribe on YoutubeApple PodcastsSpotifyGoogle Podcasts, and elsewhere

Steven acknowledges that LLMs operate differently from humans, as models excel at language generation but lack certain human modes of reasoning when it comes to complex questions or scenarios. He unpacks the BabyLM Challenge which explores whether models can be trained on human-sized amounts of data and still learn syntax or other linguistic aspects effectively.

Despite industry advancements and the trillion-dollar market opportunity, Steven agrees with Chomsky’s ethical concerns, including issues such as the presence of harmful content, misinformation, and the potential impact on job displacement.

Steven remains enthusiastic about the potential of LLMs and believes the recent advancements are a step forward to achieving artificial general intelligence, but refrains from making any concrete predictions.


Florian: First, let’s start with a bit of scene setting introduction, so tell us more about your lab. You run a lab called Colala, Computation and Language Lab and maybe expand a bit about your role at Berkeley and how it intersects with language?

Steven: My lab is really interested in two main topics. The first is language, so language processing and language acquisition. Specifically, we’re interested in how kids take the kind of input that they receive and learn the kinds of abstract rules and structures and concepts and things that you need for language. So we’re trying to develop formal computational theories of how that process can happen and that’s very exciting to us because it’s very interdisciplinary, so it draws on linguistics and computer science and neuroscience and experimental psychology. So we’re trying to put all of that together into some kind of picture that can explain kind of early language and early concepts. The second topic we study is numerical cognition, which I actually got into through language, so kids learning of words like one, two, three, four, learning of an accounting system or arithmetic and we do some computational modeling work there, some experimental psychology just trying to understand the basics of how numbers are understood and represented. And we also do some field work, so we work with a South American indigenous community to try to understand the role of formal schooling in number acquisition. So, exciting kind of combination of each of those kind of topics.

Florian: Exceptionally exciting and fascinating. Absolutely. So I learned about your work when you published a paper recently on large language models, kind of broad as possible umbrella there, and it had quite an impact. And in the paper, you say that LLMs are not just impressive, they’re philosophically important. Now, can you expand a bit on why you think so?

Steven: One of the things that I really like about the field of language and language acquisition in particular, is that it really touches on these deep questions about human nature, right? So what kinds of things need to be innate for humans, have to be just built in genetically to enable us to not only learn language, but learn all of the other conceptual domains and cognitive abilities that we know. And those kinds of questions about human nature, I think, are well, I should say they’re also deeply related to questions about human uniqueness, right? So as far as we can tell, no other species can learn human language or anything kind of remotely as complicated as human language. And so if we’re really interested in understanding what makes people the way that they are, then language is a really good place to look for that. And of course, in language acquisition, there have been decades of debates and competing theories about what kinds of things need to be there. And the kind of main, I guess, thesis of that paper is that large language models have really changed the landscape for theories there. So for a long time, for example, it was thought that language learning was just impossible without there being substantial innate constraints. So people even had very kind of mathematical arguments that would try to say, you mathematically could not figure out the right grammar from the kind of input that kids get and therefore, some pieces of that grammar have to be present innately, they have to be encoded for us genetically, whatever that means. And that argument in particular, I think, has been really refuted by large language models because they show that if you give enough text, they’re able to identify a really competent grammatical system, probably more competent than our existing linguistic theories from just observing sentences and input data like that. So it’s probably not the case that you sort of need to mathematically be provided with pieces of that grammar. The right kind of learning system is able to discover the key pieces of grammar.

SlatorPod – News, Analysis, Guests

The weekly language industry podcast. On Youtube, Apple Podcasts, Spotify, Google Podcasts, and all other major platforms.

SlatorPod – News, Analysis, Guests

Florian: Now, if some of the listeners think this sounds familiar from conceptual point of view, it’s because they might have come across the innateness kind of theory in their translation studies, linguistic studies, and of course, the grandfather or father of it all would be Noam Chomsky, right? And so the title of your paper was quite provocative. It was actually called “Modern Language Models refute Chomsky’s approach to language”, so quite the title and let me just quote from the abstract as well. You said that the rise and success of a large language model, as you already pointed out, now undermines virtually every strong claim for the innateness of language that has been proposed by generative linguistics. Modern machine learning has subverted and bypassed the entire theoretical framework of Chomsky’s approach, including its core claims to particular insights, principles, structures, and processes. Now, can we just revisit, can you help us revisit Chomsky’s kind of key paradigm and how it maybe has shaped linguistics for decades? I mean, a lot of the listeners here would, if they remember anything from linguistics, it’s probably the name Chomsky, right? So can you just help us kind of understand that first, and then we go into why you think that LLMs kind of refute and undermine that theory?

Steven: There’s a couple of key interrelated ideas with Chomsky’s approach. So one of them, for example, is that in a grammatical theory, we should be finding kind of discrete systems of rules, right? So you can think about a rule that might say something like put the subjects before the verbs or put the objects after the verbs and those kind of discrete rules… In the paper, I have one or two quotes from Chomsky talking about, for example, how probability is completely useless, so there’s nothing probabilistic or stochastic or gradient in grammatical systems. And that’s another example where large language models work in a completely different way, right? So they have a continuous space of neural network weights, and they use gradient descent, so they compute derivatives with respect to their parameters in order to tune those parameters and make them do a good job of predicting data. So there’s nothing like the kind of discrete rules that Chomsky and theories propose inside of these. And in fact, both the probability part and the gradient part are probably very important for making these models work well. So if you want to optimize a model with lots of parameters, for example, then this is the main method that people have figured out for how to do it. So I think that maybe at the most basic level, the underlying representations end up looking really, really different, right? You have a grammar which is somehow encoded into the weights of a neural network compared to a system of kind of logical rewrite rules or something. There’s other kinds of assumptions which also, I think, differ fundamentally. So, for example, since the 90s, one of main features of Chomsky’s approach has been trying to find kind of small, minimal sets of rules and principles which can explain language and he talks about that as a kind of defining feature of his approach. I think it’s actually probably not that unique in the sense that scientists generally try to find simple rules and principles. But in particular, his approach to linguistics often tries to minimize the amount of, say, memorized structure, so trying to derive as much as possible from the rules. And if you think about people, so people are very good at learning words, for example, we know tens of thousands of different words, and we also know tens of thousands of different idioms, right? Idioms just have to be memorized because we know their meaning and we know their linguistic form and the meaning is not derivable from the linguistic form, right? So, like kick the bucket has a meaning which is not derivable from the words. So we’re very, very good at learning little chunks of language like that and actually, large language models are similarly also good at that. So they’re not seeking, in Chomsky’s sense, a minimal set of rules. They’re very happy to memorize data, to memorize idioms or little pieces of language. And one of the, I think, kind of remarkable findings of kind of deep learning and modern language models has been that there exist statistical models which are good at memorizing data, but also good at generalizing to new data. So for a long time, it was thought in kind of statistical approaches that if you had a model which had too many parameters, so many parameters that it could essentially memorize most of what it sees, it wouldn’t be good at generalizing meaning extending to, say, sentences or image categories or whatever outside of its training set and deep learning tools, for whatever reason, seem able to do that. And that means that you can have things which aren’t kind of explicitly seeking minimal sets of logical rules, as in Chomsky’s theory, but instead are very good at memorization and also very good at generalization. I think that those two pieces together are kind of most of the advance and if you think about what a linguistic theory should look like, I think that large language models come at this question from a completely different point of view and end up doing really, really well on pretty much any task we can find for them, right? So they’re good at syntactic tasks, they’re reasonably okay at some kind of basic reasoning tasks, they’re good at translation, they’re good at answering questions and writing computer code and all of that kind of stuff. So I think from that starting point, which is different than Chomsky’s, they’re able to exhibit a much wider and kind of more powerful set of language abilities.

Florian: I think in your paper also, you just mentioned that the concept of outside of training data, right, and I think it was important for you… In your very first example in the paper, you said that you tested ChatGPT’s abilities with this ant could sink an aircraft carrier example to demonstrate that you have to be very careful to make sure that you test it on things that are clearly outside of any training data. Can you just tell us a bit more about how you prompted Chet GPT to get outside of that training data realm?

Steven: This has actually made kind of working with these models kind of fun, right? So they’re trained on huge amounts of text from the Internet, think billions and billions of tokens, everything in Wikipedia, for example, a lot of comment threads on different sites and whatever. So that means that if you ask them a really common kind of question, if you ask them to maybe say what it would look like to dig up an anthill, for example, there might be text on the internet that contains that information and so if they provide it back to you, then you have to worry that they’re just repeating something they’ve already seen, right, because we know they have this ability to memorize chunks of language. And this means that to really test them, you have to ask them questions which are really unlikely to have been encountered before, even on the entire Internet, and so it’s kind of fun. It takes a little bit of creativity, I think, to think of things which are outside of the box of the entire Internet. One example from the paper is that. I had asked it to describe how an ant could sink an aircraft carrier, right? And it comes up with this story about one ant kind of rallying together all of the other ants and coming up with a scheme to sink an aircraft carrier. And if you look for that text, or text like it on the Internet, it really isn’t there, so what that says is that it’s able to generate coherent discourses just from a little prompt like that, far outside of its training set. The other one in the paper was asking it to explain the fundamental theorem of arithmetic in the style of Donald Trump, right, and so that’s just a fact in number theory that you can factor a number down into its prime factors. And they’re very good at imitating style and so it gave this speech that was very reminiscent of Trump saying things like, believe me, I know a lot about prime numbers and that kind of stuff in there. So that’s also certain to be outside of its training set, and yet it’s able to put those pieces together in a coherent and sensible way.

Florian: Then maybe some critics would still argue that it’s just simply kind of an endless sequence of predicting the next token, right, and that’s how it comes up… It just kind of takes the most likely next solution and then sometimes, I was told, it kind of has to deviate a little bit from this to make it creative, so you would think that that argument wouldn’t count. It’s not just predicting, like it’s not just an endless sequence of predicting the next one, but there is something else going on there.

Steven: It’s certainly true that for many or most of these models, their training consists of being able to predict the next token in language, right? So they see some string and then they’re asked what the next word is going to be in that string and when you ask them to answer a question like that, what they’re doing is predicting the language that would follow that question. So explain the fundamental theorem of arithmetic in the style of Donald Trump. They’re taking that text and then predicting word by word what the next likely word would be and that happens to be a description of the theorem in the style of Donald Trump. So I think it’s true that they’re working like that. I think where the interesting debate is, is what exactly does that mean, right? So how I think about it is that if you were doing a really good job of predicting upcoming linguistic material, what word was going to be said next? You’d actually have to have discovered quite a bit about the world and about language, right, the grammar. So if you think about these models as having lots of parameters and kind of configuring themselves in a way in order to predict language well, probably what they’re doing is actually configuring themselves to represent some facts about the world and some facts about the dynamics of language, right? So, for example, if you gave it a prompt that said something like, you walk into a fancy Italian restaurant, what happens next, right? Well, it will just predict the next word. It’ll probably give you a plausible description of that scene, of what the next events are going to be. But if it knows that you’re going to be handed a menu and shown a table, it only knows that because it has internally represented the relationships between words like restaurant and words like menu and table and the sequential progression of events like that. So it’s built some at least approximate little model of what’s happening in the world and that model is encoded implicitly somehow in all of these billions of neural network weights. So I think of this word prediction as just a kind of description of its training setup. But I think one thing that’s been very surprising, even to people in AI and certainly in linguistics and cognitive science, is that from that kind of prediction, you’re able to discover lots about language and probably also lots about the world.

SlatorCon Remote June 2024 | $ 180

SlatorCon Remote June 2024 | $ 180

A rich online conference which brings together our research and network of industry leaders.

Buy Tickets

Register Now

Florian: Maybe it wasn’t a controversy, but you certainly, probably got a bit of pushback on Twitter and in some of these debates that I saw on YouTube, like, what are the different camps? And do you think it’s just an adjustment and some people adjust faster than others and others just cling on to like the past four decades of obviously, literature and research? Like, where do you see this going within the linguistics community, for example?

Steven: My own view is that it really changes pretty much everything in linguistics, right? So the reason for that is that there just haven’t been models that work this well in anything, right? So if you look at, for example, a generative syntax textbook, it’ll have hundreds and hundreds of pages about what the likely structures are underlying language and little arguments about why it’s this structure and not this other structure. But the problem is that many of the approaches there start from the same set of basic assumptions, right? So they start by trying to find some small set of discrete rules. They don’t start from kind of gradient continuous probabilities and kind of rich ability to memorize things. And so what that means is that most of those theories, I think, are probably not going to last very long, right, because they’re just from the wrong starting points and they’re from starting points that people had decades to work on and that those decades of effort didn’t produce anything close to the abilities that these models have. So I think of them as really changing the starting points and the core underlying assumptions of how we think about what it means to represent a grammar or what it means to represent linguistic knowledge and from my point of view, that’s great, right? That’s a real advance in our understanding and our way of thinking about things. And like, I think pretty much everything in science, those kinds of advances really necessitate moving past the prior theories, right, moving to the next theory that works better.

Florian: It must have been very hard to falsify any of these theories in the past. I mean, how can you disprove that we don’t have discrete rules in our head as humans, right? And now we have a model that does it clearly not in the way that a human would do it, but can we maybe just dwell on the question, like, how or in what ways does kind of an LLM differ from a human in generating information in natural language? Clearly, there is something, I mean, there’s something very, very generative going on with these models, right? And do you think, is it too early to even fundamentally assess the difference or how… Well, so I’m rambling a bit, but like, going back to how humans generate, can we get insights about how humans generate language and information from what we’re seeing now playing out with the LLMs? I guess that’s mildly more coherently put.

Steven: I think both of those questions are really good. So for the first one on the differences, one difference that a number of people have pointed to is that we seem to have a variety of different modes of reasoning, many of which are probably not accessible to these models. So, for instance, you can picture a 3D scene and reason about it, right? You could picture the Italian restaurant I brought up and come up with a hypothetical guess of where stuff would be laid out and whether there’s a candle on top of the table or underneath the table and kind of think through things in that kind of geometrical way. You can also reason about different entities, right? Well, I guess both reasoning and planning, so if I asked you how to get to the airport or what the sequence of steps is that you would need for doing that, probably your knowledge of that is quite rich and richer than even a large language model would have. So I think one of the main differences is that they’re mostly just doing language, but there’s a lot in that word, mostly because they are able to solve certain reasoning problems. They can maybe play a chess opening or something just based on kind of the statistical patterns of data that they’ve seen. So I think this is one of the other things that’s been surprising, is that just training on language seems to give them some of the information about the world and kind of how the world is structured, and they’re able to internalize that. So they could, for instance, tell you how to get to the airport, but if you ask them more complicated questions, let’s say that on the way to the airport, your taxi gets a flat tire or something, what should you do? If you keep making things more and more complicated, you’ll probably run into some wall with their reasoning ability and their ability to keep track of all of the moving pieces. So this is something, both of those aspects, I think, are things that people are currently working on, so trying to integrate these kinds of linguistic abilities with other kinds of kind of cognitive reasoning modules that people seem to have. That I should say is even a debate in the literature about whether that’s even necessary, right? So there’s some people who think or maybe thought that you wouldn’t need anything other than just this kind of general class of model, right? So if we just gave them enough data or the right kind of data maybe they could learn to develop reasoning or learn to reason and think about objects and physical scenes and these kinds of things. And there’s lots of evidence that they have some ability there and it’s, I think, kind of unclear still how much ability they have or how exactly it compares to people. Probably it’s kind of a coarse fuzzy model of the world that they’re developing and not as kind of sophisticated as the one that people get.

Florian: There’s one thing also, the data size issue, right? These models have been trained with like I think we’re in the 500 billion kind of reach now and some argued that there’s even bigger ones. But there’s also something I think I came across in your writing, a BabyLM, so basically training the models from scratch on human-sized amounts of linguistic data. So what is human-sized amounts of linguistic data and what’s the challenge in training the BabyLM?

Steven: This actually relates to one key weakness of the current models that I haven’t touched on here, which is that they’re trained on much more data than people get. So this has been one of the kind of primary responses that people have said about the article, right, is like, okay, these aren’t actually relevant to human language learning because maybe they get 100 or 1000 times as much data as a human kid gets. And there’s two, I think, main responses to that, so one is that these models are very, very new and we don’t actually know how much data is necessary. So it could be that there’s a nearby architecture, so something kind of like the existing models, but maybe with a little twist or a certain kind of recurrence or something that allows them to learn what they need from much, much less data. The second thing to say is that probably a lot of that data is just not going into learning the syntax, the grammar of the language. Probably a lot of it is going into learning either the semantics of the language or these other kind of semantic aspects of learning about the world and kind of structures and things in the world. And so if that’s true, it could be the case that learning grammar and language is not so hard, but learning kind of semantics and meaning and about the world takes lots and lots of data. And of course, human learners are in a very different situation than large language models in that they’re getting independent data about the world, data that’s independent of language from interacting with the world or interacting with other people in the world. So that’s kind of how the data issue, I think, is relevant to these questions about whether the models have anything to say about what human learning is actually like. The BabyLM Challenge is this really exciting project where people are kind of competing to see whether you can train a model like this on human-sized amounts of data. I believe human-sized is something like 10 to 100 million tokens and if that’s roughly the amount of data that you get in childhood, then it’s really important for us to know whether it’s possible to take that amount of data and learn syntax, for example, or to develop other models which are able to learn syntax from that sized data. Part of why BabyLM is a thing is that these kind of current models like ChatGPT, for example, are developed by AI companies who don’t really care about human learning and language acquisition, right? They’re just trying to build a useful product, and so they don’t really care about the scaling with respect to the amount of data. If you’re interested in these things as language acquisition theories, then you really care about the amount of data because if it takes a trillion tokens or whatever, then that’s not going to be plausible. But many people are optimistic, I think, that it can be done with much less.

Florian: Couldn’t you argue that maybe you said between ten and 100 million tokens, right, but for a human, so that’s a lot less than these models, right? But there’s all these other kind of multimodal information that a child will get. I think you even mentioned it, right? I mean, so you ask your mother and then she actually reacts in a certain way, and you’re looking at it sort of like visual input that is not necessarily linguistic. So is that taken into account here at all, or we’re just kind of separating the linguistic component out?

Steven: In BabyLM, I believe you’re able to include multimodal information if you want. So you could have a learning model, for example, that watched 1000 hours of video and tried to learn about events and event structure in the world like that. And I think in general, that’s very hard, right, because you can imagine trying to make the statistical learning model, which could take 1000 hours of video and learn that there are objects, or that objects sit in certain spatial relationships or something like… Whatever kids know about objects and the world I think it’s a very hard task to extract that from video, but I think that’s what most people think is going on with child acquisition, right? So kids don’t require 100 billion tokens because they’re in a situation where they can learn a word from a single instance, right? You hear the word dacks when there’s a kangaroo around and you figure out that dacks means kangaroo and that kind of learning mechanism can be very fast both on the syntax and the semantics side and that kind of experience is not what these language models have. And so some people take that and say, well, that means that they’re irrelevant to language acquisition. Other people, like me, take it and think like, okay, it means that probably we can make versions of these models which work with much less data and which are therefore much more directly relevant to kind of real world learning. I Recruit Talent. Find Jobs

LocJobs is the new language industry talent hub, where candidates connect to new opportunities and employers find the most qualified professionals in the translation and localization industry. I Recruit Talent. Find Jobs

Florian: I’ll follow the BabyLM challenge. I understand now it’s very good that you put this in context, that this is actually very important for your part of the linguistics field because the amount of data you put in is so crucial to kind of the overarching kind of conclusions you’re taking from it. Let’s bring it back to Chomsky. I find it interesting that it’s taking somewhat of a luddite position on this because like in a New York Times article, and I’m going to have to quote a bit here, he’s saying that while AI tech, like ChatGPT may have some practical uses, he actually mentions language translation and information retrieval, they’re not capable of true understanding or consciousness. And then he says it lacks the embodied intelligence and perceptual experience that humans possess. He says that it’s tech for profit and efficiency and may ultimately lead to greater inequality, job displacement. And finally, he kind of urges a more critical and thoughtful approach to AI development. Now, from an industry perspective, I wonder how relevant are any of these questions when LLMs are being shipped and it’s basically a trillion-dollar market opportunity? I’m not sure. You’re kind of sitting in the middle like you’re criticizing him from an academic point of view, but also from an industry point of view I wonder if what’s the point of making these points in a sense, if it’s happening anyway?

Steven: I probably agree with him a lot on the kind of ethics issues. So for example, when you train on text from the internet, there’s a lot of horrible things on the internet. And these models, even ChatGPT, at least the early versions, you could extract horrible things from them. And from an industry perspective that’s very bad because if you want to rely on that, if you want to rely on that kind of technology, you need to be able to trust that it’s not going to say horrible things or exhibit illegal biases, for example, or immoral biases. So I think that there’s lots of questions and concerns there. There’s also lots of questions and concerns about, for example, misinformation, so people have pointed out that when you have models like this, they’re kind of the perfect tool for spreading misinformation or trying to influence elections or other kinds of things, which I think you could see at Facebook, for example. Everybody thinking it’s a friendly social media site and then political groups being able to hijack the advertising to really push around elections. And I think that there’s lots of kind of unintended and likely still unanticipated consequences like those. So I think there’s lots to be worried about with these models and part of what makes it complicated is, like you said, everybody is able to make them. There was an article in the New York Times last week or the week before about Nano GPT, which is basically a GPT model you can train in an hour on your desktop, right? And when the technology is that accessible, it’s very hard to think about regulating it or controlling it in any way. But I think that those are important concerns and there’s certainly important concerns that people want to use this in an applied setting. So I think that I agree with him on all of those kinds of concerns. Also, I guess concerns about taking people’s jobs, right? So there’s lots of jobs probably that can be replaced by this and the people who are making these technologies I don’t think are thinking through the societal consequences of what that will be. All of that said, I think that on the kind of language science side, I think that there’s really interesting questions about understanding, for example, where I actually think that these models probably have some form of understanding and in fact, their form of understanding is probably a lot like our form of understanding. So in collaboration with Felix Hill, who’s a researcher at DeepMind, we wrote this paper maybe about a year ago on kind of understanding and concepts in these models. So one intuition a lot of people have is that to really understand something, you have to know about the physical reference of the thing, right? What’s our example? We had an example of a postage stamp. Okay, if you really want to understand a postage stamp, you have to be able to pick out the physical thing and know kind of physically what it looks like. Those are kind of the defining features of the term. And what we argued actually is that there’s this kind of long-standing approach in philosophy of mind and philosophy of language which says it’s not really the physical things that define our concepts, it’s really the relationships between concepts. So for example, what makes something a postage stamp is something like it’s a thing that you pay for and put on a letter so that the government will deliver the letter to some address, okay? So almost sort of definitional, but certainly some relationship among those pieces. And if you think about that definition, you could probably imagine types of postage stamps which don’t physically exist, but which everybody would call a postage stamp, right? So for example, a postage stamp that was made out of glass, you could imagine some country somewhere decided that they were going to issue glass postage stamps and little like microscope slide covers or something, little thin pieces of glass you could attach to a letter. And that kind of example is one where it can match all of the kind of definitional, relational types of properties. It’s something you pay for and you’ll attach it to a letter, but it’s a physical instantiation of a postage stamp you’ve maybe never even thought about before. So if you can think about it and agree that would be a postage stamp, even though it’s a physical thing you’ve never seen before, that tells us that the physical part is not the part that defines the concept. It’s much more likely that the thing that defines the concept are these relationships to other concepts. And those relationships, we argue, are exactly what large language models have. So they know that you should do that with a postage stamp, and they could maybe even reason a little bit about situations in which you had to pay for a postage stamp or what somebody would be likely to do with one or something like that. So even though they don’t know anything about the physical grounding of those concepts, they still know the relationships and we argue that that’s a really compelling picture of how human concepts and human meanings work. This is, I guess, getting back to your very first question about these things being philosophically profound. I think that they really are, because they force us to think about these kinds of questions like what do we really mean by meaning or what do we really mean by grammar? And they show you a system which seems to have lots of those aspects, right? It knows a lot about grammar and it knows a lot about meaning in this sense of relationships and maybe that’s most of what meaning is to us.

Florian: Do you see this as a step towards AGI, artificial general intelligence? Are we there already? Are we kind of knocking at the door? Or are we now just kind of really starting to understand how kind of powerful computers have become because they now speak our language? I mean, until now you had to code like, I can’t code, but now I can actually kind of prompt it in my own language. So where do you stand on this kind of AGI spectrum debate?

Steven: I think it’s a very exciting time. I think that large language models in particular, but deep learning in general, are just a huge advance over the state of the field, say 10 or 15 years ago, where people knew some of these kind of pieces and were kind of hopeful that the pieces could be put together in a useful way. But now there’s systems which you could demonstrate work well and they work better than pretty much any system on pretty much any language task you want to find, so translation or parsing or question answering or converting between language and code or any of those kinds of things. There just aren’t other competing models which can do this. So from my point of view, the advances that have been made over the last, say, 10 years in language models are a huge step towards artificial intelligence, artificial general intelligence, but they’re not quite there yet. And they’re not quite there yet probably because of this issue that these models engage in kind of one very specific mode of reasoning which is tied to language and humans seem able to reason and think about the world in a variety of different ways. So I don’t think it’s going to be that hard to incorporate reasoning into these models or more sophisticated kinds of representations of the type that people have. And so that makes me think that AGI is probably very close, due in large part to these kind of recent advances, but I don’t know. The other thing I’ve learned from these models is that I have no ability to predict anything about what will happen in the future, because if you had asked people 15 years ago, they would have said that this approach couldn’t possibly do language or couldn’t possibly capture meanings in language or grammar and that’s a large part of what these models have shown to be wrong. So I guess I’m trying not to make predictions, but I’m very enthusiastic about it.