The article puts scare quotes around "understand" etc. to try to head off critiques around the lack of precision or scientific language, but I think this is a really good example of where casual use of these terms can get pretty misleading.
Because code LLMs have been trained on the syntactic form of the program and not its execution, it's not correct — even if the correlation between variable annotations and requested completions was perfect (which it's not) — to say that the model "understands nullability", because nullability means that under execution the variable in question can become null, which is not a state that it's possible for a model trained only on a million programs' syntax to "understand". You could get the same result if e.g. "Optional" means that the variable becomes poisonous and checking "> 0" is eating it, and "!= None" is an antidote. Human programmers can understand nullability because they've hopefully run programs and understand the semantics of making something null.
The paper could use precise, scientific language (e.g. "the presence of nullable annotation tokens correlates to activation of vectors corresponding to, and emission of, null-check tokens with high precision and accuracy") which would help us understand what we can rely on the LLM to do and what we can't. But it seems like there is some subconscious incentive to muddy how people see these models in the hopes that we start ascribing things to them that they aren't capable of.
I was going to say "so you believe the LLM's don't have the capacity to understand" but then I realized that the precise language would be something like "the presence of photons in this human's retinas in patterns encoding statements about LLM's having understanding correlates to the activation of neuron signaling chains corresponding to, and emission of, muscle activations engaging keyboard switches, which produce patterns of 'no they don't' with high frequency."
The critiques of mental state applied to the LLM's are increasingly applicable to us biologicals, and that's the philosophical abyss we're staring down.
Countering the argument that LLMs are just gloriefied probability machines and do not undertand or think with "how do you know humans are not the same" has been the biggest achievement of AI hypemen (and yes, it's mostly men).
Of course, now you can say "how do you know that our brains are not just efficient computers that run LLMs", but I feel like the onus of proof lies on the makers of this claim, not on the other side.
It is very likely that human intelligence is not just autocomplete on crack, given all we know about neuroscience so far.
No it's not. He gave you modal conditions on "understanding", he said: predicting the syntax of valid programs, and their operational semantics, ie., the behaviour of the computer as it runs.
I would go much further than this; but this is a de minimus criteria that the LLM already fails.
What zealots eventually discover is that they can hold their "fanatical proposition" fixed in the face of all opposition to the contrary, by tearing down the whole edifice of science, knowledge, and reality itself.
If you wish to assert, against any reasonable thought, that the sky is a pink dome you can do so -- first that our eyes are broken, and then, eventually that we live in some paranoid "philosophical abyss" carefully constructed to permit your paranoia.
This abursidty is exhausting, and I'd wish one day to find fanatics who'd realise it quickly and abate it -- but alas, I have never.
If you find yourself hollowing-out the meaning of words to the point of making no distinctions, denying reality to reality itself, and otherwise arriving at a "philosophical abyss" be aware that it is your cherished propositions which are the maddness and nothing else.
Here: no, the LLM does not understand. Yes, we do. It is your job to begin from reasonable premises and abduce reasonable theories. If you do not, you will not.
This only applies to people who understand how computers and computer programs work, because someone who doesn't externalize their thinking process would never ascribe human elements of consciousness to inanimate materials.
Certainly many ancient people worshiped celestial objects or crafted idols by their own hands and ascribed to them powers greater than themselves. That doesn't really help in the long run compared to taking personal responsibility for one's own actions and motives, the best interests of their tribe or community, and taking initiative to understand the underlying cause of mysterious phenomena.
We don't really have a clue what they are and aren't capable of. Prior to the LLM-boom, many people – and I include myself in this – thought it'd be impossible to get to the level of capability we have now purely from statistical methods and here we are. If you have a strong theory that proves some bounds on LLM-capability, then please put it forward. In the absence of that, your sceptical attitude is just as sus as the article's.
I majored in CogSci at UCSD in the 90's. I've been interested and active in the machine learning world for decades. The LLM boom took me completely and utterly by surprise, continues to do so, and frankly I am most mystified by the folks who downplay it. These giant matrixes are already so far beyond what we thought was (relatively) easily achievable that even if progress stopped tomorrow, we'd have years of work to put in to understand how we got here. Doesn't mean we've hit AGI, but what we already have is truly remarkable.
> Because code LLMs have been trained on the syntactic form of the program and not its execution
One of the very first tests I did of ChatGPT way back when it was new was give it a relatively complex string manipulation function from our code base, strip all identifying materials from the code (variable names, the function name itself, etc), and then provide it with inputs and ask it for the outputs. I was surprised that it could correctly generate the output from the input.
So it does have some idea of what the code actually does not just syntax.
> Because code LLMs have been trained on the syntactic form of the program and not its execution
What makes you think this? It has been trained on plenty of logs and traces, discussion of the behavior of various code, REPL sessions, etc. Code LLMs are trained on all human language and wide swaths of whatever machine-generated text is available, they are not restricted to just code.
As far as you know, AI labs are doing E2E RL training with running code in the loop to advance the model's capability to act as an agent (for cursor et al).
LLMs "understand" nullability to the extent that texts they have been trained on contain examples of nullability being used in code, together with remarks about it in natural language. When the right tokens occur in your query, other tokens get filled in from that data in a clever way. That's all there is to it.
The LLM will not understand, and is incapable of developing an understanding, of a concept not present in its training data.
If try to teach it the basics of the misunderstood concept in your chat, it will reflect back a verbal acknowledgement, restated in different words, with some smoothly worded embellishments which looks like the external trappings of understanding. It's only a mirage though.
The LLM will code anything, no matter how novel, if you give it detailed enough instructions and clarifications. That's just a a language translation task from pseudo-code to code. Being a language model, it's designed for that.
LLM is like the bar waiter who has picked up on economics and politics talk, and is able to interject with something clever sounding, to the surprise of the patrons. Gee, how does he or she understand the workings of the international monetary fund, and what the hell are they doing working in this bar?
There seems to be a typo in OP's "Visualizing Our Results" - but things make perfect sense if red is non-nullable, green is nullable.
I'd be really curious to see where the "attention" heads of the LLM look when evaluating the nullability of any given token. Does just it trust the Optional[int] return type signature of the function, or does it also skim through the function contents to understand whether that's correct?
It's fascinating to me to think that the senior developer skillset of being able to skim through complicated code, mentally make note of different tokens of interest where assumptions may need to be double-checked, and unravel that cascade of assumptions to track down a bug, is something that LLMs already excel at.
Sure, nullability is an example where static type checkers do well, and it makes the article a bit silly on its own... but there are all sorts of assumptions that aren't captured well by type systems. There's been a ton of focus on LLMs for code generation; I think that LLMs for debugging makes for a fascinating frontier.
One thing that is exciting in the text is an attempt to go away from describing whether LLM 'understands' which I would argue an ill posed question, but instead rephrase it in terms of something that can actually be measured.
It would be good to list a few possible ways of interpreting 'understanding of code'.
It could possibly include:
1) Type inference for the result
2) nullability
3) runtime asymptotics
4) What the code does
5) predicting a bunch of language tokens from the compressed database of knowledge encoded as weights, calculated out of numerous examples that exploit nullability in code and talk about it in accompanying text.
Is there any way you can tell whether a human understands something other than by asking them a question and judging their answer?
Nobody interrogates each other's internal states when judging whether someone understands a topic. All we can judge it based on are the words they produce or the actions they take in response to a situation.
The way that systems or people arrive at a response is sort of an implementation detail that isn't that important when judging whether a system does or doesn't understand something. Some people understand a topic on an intuitive, almost unthinking level, and other people need to carefully reason about it, but they both demonstrate understanding by how they respond to questions about it in the exact same way.
You are saying no while presenting nothing to contradict what GP said.
Judging someone's external "involuntary cues" is not interrogating their internal state. It is, as you said, judging their response (a synonym for "answer") - and that judgment is also highly imperfect.
(It's worth noting that focusing so much on the someone's body language and tone that you ignore the actual words they said is a communication issue associated with not being on the spectrum, or being too allistic)
The visualisation of how the model sees nullability was fascinating.
I'm curious if this probing of nullability could be composed with other LLM/ML-based python-typing tools to improve their accuracy.
Maybe even focusing on interfaces such as nullability rather than precise types would work better with a duck-typed language like python than inferring types directly (i.e we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric).
This post actually mostly uses the subset of Python where nullability is checked. The point is not to introduce new LLM capabilities, but to understand more about how existing LLMs are reasoning about code.
I am more than aware of Typescript, you seem to have misunderstood my point: I was not describing a particular type system (of which there have been many of this ilk) but rather conjecturing that targeting interfaces specifically might make LLM-based code generation/type inference more effective.
In practical AI-assisted coding in TypeScript I have found that it is good to add in Cursor Rules to avoid anything nullable, unless it is a well-designed choice. In my experience, it makes code much better.
I don’t get the problem with null values as long as you can statically reason about them which wasn’t even the case in Java where you had to always do runtime null-guards before access.
But in Typescript, who cares? You’d be forced to handle null the same way you’d be forced to handle Maybe<T> = None | Just<T> except with extra, unidiomatic ceremony in the latter case.
Would be fun if they also „cancelled the nullability direction“.. the llms probably would start hallucinating new explanations for what is happening in the code.
> Interestingly, for models up to 1 billion parameters, the loss actually starts to increase again after reaching a minimum. This might be because as training continues, the model develops more complex, non-linear representations that our simple linear probe can’t capture as well. Or it might be that the model starts to overfit on the training data and loses its more general concept of nullability.
I'm curious what happens if you run the LLM with variable names that occur often with nullable variables, but then use them with code that has a non-nullable variable.
If you're interested in a more scientific treatment of the topic, the post links to a technical report which reports the numbers in detail. This post is instead an attempt to explain the topics to a more general audience, so digging into the weeds isn't very useful.
The code is entirely wrong. That validates something that's close to a NAPN number but isn't actually a NAPN number. In particular the area code cannot start with 0 nor can the central office code. There are several numbers, like 911, which have special meaning, and cannot appear in either position.
You'd get better results if you went to Stack Overflow and stole the correct answer yourself. Would probably be faster too.
This is why "non technical code writing" is a terrible idea. The underlying concept is explicitly technical. What are we even doing?
This is really interesting! Intuitively it's hard to grasp that you can just subtract two average states and get a direction describing the model's perception of nullability.
As every fifth thread becomes some discussion of LLM capabilities, I think we need to shift the way we talk about this to be less like how we talk about software and more like how we talk about people.
"LLM" is a valid category of thing in the world, but it's not a thing like Microsoft Outlook that has well-defined capabilities and limitations. It's frustrating reading these discussions that constantly devolve into one person saying they tried something that either worked or didn't, then 40 replies from other people saying they got the opposite result, possibly with a different model, different version, slight prompt altering, whatever it is.
LLMs possibly have the capability to understand nullability, but that doesn't mean every instance of every model will consistently understand that or anything else. This is the same way humans operate. Humans can run a 4-minute mile. Humans can run a 10-second 100 meter dash. Humans can develop and prove novel math theorems. But not all humans, not all the time, performance depends upon conditions, timing, luck, and there has probably never been a single human who can do all three. It takes practice in one specific discipline to get really good at that, and this practice competes with or even limits other abilities. For LLMs, this manifests in differences with the way they get fine-tuned and respond to specific prompt sequences that should all be different ways of expressing the same command or query but nonetheless produce different results. This is very different from the way we are used to machines and software behaving.
Yeah the link title is overclaiming a bit, the actual post title doesn't make such a general claim, and the post itself examines several specific models and compares their understanding.
This is like claiming a photorestor controlled night light "understands when it is dark" or that a bimetallic strip thermostat "understands temperature". You can say those words, and it's syntactically correct but entirely incorrect semantically.
The post includes this caveat. Depending on your philosophical position about sentience you might say that LLMs can't possibly "understand" anything, and the post isn't trying to have that argument. But to the extent that an LLM can "understand" anything, you can study its understanding of nullability.
People don’t use “understand” for machines in science because people may or may not believe in the sentience of machines. That would be a weird catering to panpsychism.
Where is the boundary where this becomes semantically correct? It's easy for these kinds of discussions to go in circles, because nothing is well defined.
That depends entirely on whether you believe understanding requires consciousness.
I believe that the type of understanding demonstrated here doesn't. Consciousness only comes into play when we become aware that such understanding has taken place, not on the process itself.
Shameless plug of personal blog post, but relevant. Still not fully edited, so writing is a bit scattered, but crux is we now have the framework for talking about consciousness intelligently. It's not as mysterious as in the past, considering advances in non-equilibrium thermodynamics and the Free Energy Principle in particular.
You declare this very plainly without evidence or argument, but this is an age-old controversial issue. It’s not self-evident to everyone, including philosophers.
It's not age-old nor is it controversial. LLMs aren't intelligent by any stretch of the imagination. Each word/token is chosen as that which is statistically most likely to follow the previous. There is no capability for understanding in the design of an LLM. It's not a matter of opinion; this just isn't how an LLM works.
Any comparison to the human brain is missing the point that an LLM only simulates one small part, and that's notably not the frontal lobe. That's required for intelligence, reasoning, self-awareness, etc.
So, no, it's not a question of philosophy. For an AI to enter that realm, it would need to be more than just an LLM with some bells and whistles; an LLM plus something else, perhaps, something fundamentally different which does not yet currently exist.
> Each word/token is chosen as that which is statistically most likely to follow the previous.
The best way to predict the weather is to have a model which approximates the weather. The best way to predict the results of a physics simulation is to have a model which approximates the physical bodies in question. The best way to predict what word a human is going to write next is to have a model that approximates human thought.
LLMs don't approximate human thought, though. They approximate language. That's it.
Please, I'm begging you, go read some papers and watch some videos about machine learning and how LLMs actually work. It is not "thinking."
I fully realize neural networks can approximate human thought -- but we are not there yet, and when we do get there, it will be something that is not an LLM, because an LLM is not capable of that -- it's not designed to be.
> LLMs don't approximate human thought, though. ...Please, I'm begging you, go read some papers and watch some videos about machine learning and how LLMs actually work.
I know how LLMs work; so let me beg you in return, listen to me for a second.
You have a theoretical-only argument: LLMs do text prediction, and therefore it is not possible for them to actually think. And since it's not possible for them to actually think, you don't need to consider any other evidence.
I'm telling you, there's a flaw in your argument: In actuality, the best way to do text prediction is to think. An LLM that could actually think would be able to do text prediction better than an LLM that can't actually think; and the better an LLM is able to approximate human thought, the better its predictions will be. The fact that they're predicting text in no way proves that there's no thinking going on.
Now, that doesn't prove that LLMs actually are thinking; but it does mean that they might be thinking. And so you should think about how you would know if they're actually thinking or not.
Artificial neural networks already are approximating how neurons in a brain work, it's just at a scale that's several orders of magnitude smaller.
Our limiting factor for reaching brain-like intelligence via ANN is probably more of a hardware limitation. We would need over 100 TB to store the weights for the neurons, not to mention the ridiculous amount of compute to run it.
Language can be a (lossy) serialization of thought, yes. But language is not thought, nor inherently produced by thought. Most people agree that a process randomly producing grammatically correct sentences is not thinking.
Many people don't think we have any good evidence that our brains aren't essentially the same thing: a stochastic statistical model that produces outputs based on inputs.
Of course, you're right. Neural networks mimic exactly that after all. I'm certain we'll see an ML model developed someday that fully mimics the human brain. But my point is an LLM isn't that; it's a language model only. I know it can seem intelligent sometimes, but it's important to understand what it's actually doing and not ascribe feelings to it that don't exist in reality.
Too many people these days are forgetting this key point and putting a dangerous amount of faith in ChatGPT etc. as a result. I've seen DOCTORS using ChatGPT for diagnosis. Ignorance is scary.
If you're willing to torture the analogy you can find a way to describe literally anything as a system of outputs based on inputs. In the case of the brain to LLM comparison, people are inclined to do it because they're eager to anthropomorophize something that produces text they can interpret as a speaker, but it's totally incorrect to suggest that our brains are "essentially the same thing" as LLMs. The comparison is specious even on a surface level. It's like saying that birds and planes are "essentially the same thing" because flight was achieved by modeling planes after birds.
For example, they are dismal at math problems that aren't just slight variations of problems they've seen before.
Here's one by blackandredpenn where ChatGPT insisted the solution to problem that could be solved by high school / talented middle school students was correct, even after trying to convince it it was wrong. https://youtu.be/V0jhP7giYVY?si=sDE2a4w7WpNwp6zU&t=837
> For example, they are dismal at math problems that aren't just slight variations of problems they've seen before.
I know plenty of teachers who would describe their students the exact same way. The difference is mostly one of magnitude (of delta in competence), not quality.
Also, I think it's important to note that by "could be solved by high school / talented middle school students" you mean "specifically designed to challenge the top ~1% of them". Because if you say "LLMs only manage to beat 99% of middle schoolers at math", the claim seems a whole lot different.
Do biologists and neuroscientists not have any good evidence or is that just computer scientists and engineers speaking outside of their field of expertise? There's always been this danger of taking computer and brain comparisons too literally.
That argument only really applies to base models. After that we train them to give correct and helpful answers, not just answers that are statistically probable in the training data.
But even if we ignore that subtlety, it's not obvious that training a model to predict the next token doesn't lead to a world model and an ability to apply it. If you gave a human 10 physics books and told them that in a month they have a test where they have to complete sentences from the book, which strategy do you think is more successful: trying to memorize the books word by word or trying to understand the content?
The argument that understanding is just an advanced form of compression far predates LLMs. LLMs clearly lack many of the facilities humans have. Their only concept of a physical world comes from text descriptions and stories. They have a very weird form of memory, no real agency (they only act when triggered) and our attempts at replicating an internal monologue are very crude. But understanding is one thing they may well have, and if the current generation of models doesn't have it the next generation might
Or like saying the photoreceptors in your retina understand when it's dark. Or like claiming the temperature sensitive ion channels in your peripheral nervous system understand how hot it is.
This is a fallacy I've seen enough on here that I think it needs a name. Maybe the fallacy of Theoretical Reducibility (doesn't really roll off the tongue)?
When challenged, everybody becomes an eliminative materialist even if it's inconsistent with their other views. It's very weird.
We’re all just elementary particles being clumped together in energy gradients, therefore my little computer project is sentient—this is getting absurd.
Well you can say it doesn't understand, but then you don't have a very useful definition of the word.
You can say this is not 'real' understanding but you like many others will be unable to clearly distinguish this 'fake' understanding from 'real' understanding in a verifiable fashion, so you are just playing a game of meaningless semantics.
You really should think about what kind of difference is supposedly so important yet will not manifest itself in any testable way - an invented one.
Sorry, this is more about the discussion of this article than the article itself. The moving goal posts that acolytes use to declare consciousness are becoming increasingly cult-y.
We spent 40 years moving the goal posts on what constitutes AI. Now we seem to have found an AI worthy of that title and instead start moving the goal posts on "consciousness", "understanding" and "intelligence".
Just as long as we continue to recognize that it is a tool that humans can use, and don't start trying to treat it as a human, or as a life, and I won't complain
I'm saving my anger for when idiots start to argue that LLMs are alive and deserve human rights
We don’t even have a good way to quantify human ability. The idea that we could suddenly develop a technique to quantify human ability because we now have a piece of technology that would benefit from that quantification is absurd.
That doesn’t mean we shouldn’t try to measure the ability of an LLM. But it does mean that the techniques used to quantify an LLMs ability are not something that can be applied to humans outside of narrow focus areas.
Indeed, science is a process of discovery and adjusting goals and expectations. It is not a mountain to be summited. It is highly telling that the LLM boosters do not understand this. Those with a genuine interest in pushing forward our understanding of cognition do.
They believe that once they reach this summit everything else will be trivial problems that can be posed to the almighty AI. It's not that they don't understand the process, it's that they think AI is going to disrupt that process.
They literally believe that the AI will supersede the scientific process. It's crypto shit all over again.
> We spent 40 years moving the goal posts on what constitutes AI.
Who is "we"?
I think of "AI" as a pretty all-encompassing term. ChatGPT is AI, but so is the computer player in the 1995 game Command and Conquer, among thousands of other games. Heck, I might even call the ghosts in Pac-man "AI", even if their behavior is extremely simple, predictable, and even exploitable once you understand it.
Look I’m no stranger to love. you know the rules and so do I… you can’t find this conversation with any other guy.
But since the parent was making a meta commentary on this conversation I’d like to introduce everyone here as Kettle to a friend of mine known as #000000
The article puts scare quotes around "understand" etc. to try to head off critiques around the lack of precision or scientific language, but I think this is a really good example of where casual use of these terms can get pretty misleading.
Because code LLMs have been trained on the syntactic form of the program and not its execution, it's not correct — even if the correlation between variable annotations and requested completions was perfect (which it's not) — to say that the model "understands nullability", because nullability means that under execution the variable in question can become null, which is not a state that it's possible for a model trained only on a million programs' syntax to "understand". You could get the same result if e.g. "Optional" means that the variable becomes poisonous and checking "> 0" is eating it, and "!= None" is an antidote. Human programmers can understand nullability because they've hopefully run programs and understand the semantics of making something null.
The paper could use precise, scientific language (e.g. "the presence of nullable annotation tokens correlates to activation of vectors corresponding to, and emission of, null-check tokens with high precision and accuracy") which would help us understand what we can rely on the LLM to do and what we can't. But it seems like there is some subconscious incentive to muddy how people see these models in the hopes that we start ascribing things to them that they aren't capable of.
I was going to say "so you believe the LLM's don't have the capacity to understand" but then I realized that the precise language would be something like "the presence of photons in this human's retinas in patterns encoding statements about LLM's having understanding correlates to the activation of neuron signaling chains corresponding to, and emission of, muscle activations engaging keyboard switches, which produce patterns of 'no they don't' with high frequency."
The critiques of mental state applied to the LLM's are increasingly applicable to us biologicals, and that's the philosophical abyss we're staring down.
Countering the argument that LLMs are just gloriefied probability machines and do not undertand or think with "how do you know humans are not the same" has been the biggest achievement of AI hypemen (and yes, it's mostly men).
Of course, now you can say "how do you know that our brains are not just efficient computers that run LLMs", but I feel like the onus of proof lies on the makers of this claim, not on the other side.
It is very likely that human intelligence is not just autocomplete on crack, given all we know about neuroscience so far.
No it's not. He gave you modal conditions on "understanding", he said: predicting the syntax of valid programs, and their operational semantics, ie., the behaviour of the computer as it runs.
I would go much further than this; but this is a de minimus criteria that the LLM already fails.
What zealots eventually discover is that they can hold their "fanatical proposition" fixed in the face of all opposition to the contrary, by tearing down the whole edifice of science, knowledge, and reality itself.
If you wish to assert, against any reasonable thought, that the sky is a pink dome you can do so -- first that our eyes are broken, and then, eventually that we live in some paranoid "philosophical abyss" carefully constructed to permit your paranoia.
This abursidty is exhausting, and I'd wish one day to find fanatics who'd realise it quickly and abate it -- but alas, I have never.
If you find yourself hollowing-out the meaning of words to the point of making no distinctions, denying reality to reality itself, and otherwise arriving at a "philosophical abyss" be aware that it is your cherished propositions which are the maddness and nothing else.
Here: no, the LLM does not understand. Yes, we do. It is your job to begin from reasonable premises and abduce reasonable theories. If you do not, you will not.
This only applies to people who understand how computers and computer programs work, because someone who doesn't externalize their thinking process would never ascribe human elements of consciousness to inanimate materials.
Certainly many ancient people worshiped celestial objects or crafted idols by their own hands and ascribed to them powers greater than themselves. That doesn't really help in the long run compared to taking personal responsibility for one's own actions and motives, the best interests of their tribe or community, and taking initiative to understand the underlying cause of mysterious phenomena.
We don't really have a clue what they are and aren't capable of. Prior to the LLM-boom, many people – and I include myself in this – thought it'd be impossible to get to the level of capability we have now purely from statistical methods and here we are. If you have a strong theory that proves some bounds on LLM-capability, then please put it forward. In the absence of that, your sceptical attitude is just as sus as the article's.
I majored in CogSci at UCSD in the 90's. I've been interested and active in the machine learning world for decades. The LLM boom took me completely and utterly by surprise, continues to do so, and frankly I am most mystified by the folks who downplay it. These giant matrixes are already so far beyond what we thought was (relatively) easily achievable that even if progress stopped tomorrow, we'd have years of work to put in to understand how we got here. Doesn't mean we've hit AGI, but what we already have is truly remarkable.
> Because code LLMs have been trained on the syntactic form of the program and not its execution
One of the very first tests I did of ChatGPT way back when it was new was give it a relatively complex string manipulation function from our code base, strip all identifying materials from the code (variable names, the function name itself, etc), and then provide it with inputs and ask it for the outputs. I was surprised that it could correctly generate the output from the input.
So it does have some idea of what the code actually does not just syntax.
> Because code LLMs have been trained on the syntactic form of the program and not its execution
What makes you think this? It has been trained on plenty of logs and traces, discussion of the behavior of various code, REPL sessions, etc. Code LLMs are trained on all human language and wide swaths of whatever machine-generated text is available, they are not restricted to just code.
How do you know that these models haven't been trained by running programs?
At least, it's likely that they've been trained on undergrad textbooks that explain program behaviors and contain exercises.
As far as you know, AI labs are doing E2E RL training with running code in the loop to advance the model's capability to act as an agent (for cursor et al).
LLMs "understand" nullability to the extent that texts they have been trained on contain examples of nullability being used in code, together with remarks about it in natural language. When the right tokens occur in your query, other tokens get filled in from that data in a clever way. That's all there is to it.
The LLM will not understand, and is incapable of developing an understanding, of a concept not present in its training data.
If try to teach it the basics of the misunderstood concept in your chat, it will reflect back a verbal acknowledgement, restated in different words, with some smoothly worded embellishments which looks like the external trappings of understanding. It's only a mirage though.
The LLM will code anything, no matter how novel, if you give it detailed enough instructions and clarifications. That's just a a language translation task from pseudo-code to code. Being a language model, it's designed for that.
LLM is like the bar waiter who has picked up on economics and politics talk, and is able to interject with something clever sounding, to the surprise of the patrons. Gee, how does he or she understand the workings of the international monetary fund, and what the hell are they doing working in this bar?
There seems to be a typo in OP's "Visualizing Our Results" - but things make perfect sense if red is non-nullable, green is nullable.
I'd be really curious to see where the "attention" heads of the LLM look when evaluating the nullability of any given token. Does just it trust the Optional[int] return type signature of the function, or does it also skim through the function contents to understand whether that's correct?
It's fascinating to me to think that the senior developer skillset of being able to skim through complicated code, mentally make note of different tokens of interest where assumptions may need to be double-checked, and unravel that cascade of assumptions to track down a bug, is something that LLMs already excel at.
Sure, nullability is an example where static type checkers do well, and it makes the article a bit silly on its own... but there are all sorts of assumptions that aren't captured well by type systems. There's been a ton of focus on LLMs for code generation; I think that LLMs for debugging makes for a fascinating frontier.
One thing that is exciting in the text is an attempt to go away from describing whether LLM 'understands' which I would argue an ill posed question, but instead rephrase it in terms of something that can actually be measured.
It would be good to list a few possible ways of interpreting 'understanding of code'. It could possibly include: 1) Type inference for the result 2) nullability 3) runtime asymptotics 4) What the code does
5) predicting a bunch of language tokens from the compressed database of knowledge encoded as weights, calculated out of numerous examples that exploit nullability in code and talk about it in accompanying text.
Is there any way you can tell whether a human understands something other than by asking them a question and judging their answer?
Nobody interrogates each other's internal states when judging whether someone understands a topic. All we can judge it based on are the words they produce or the actions they take in response to a situation.
The way that systems or people arrive at a response is sort of an implementation detail that isn't that important when judging whether a system does or doesn't understand something. Some people understand a topic on an intuitive, almost unthinking level, and other people need to carefully reason about it, but they both demonstrate understanding by how they respond to questions about it in the exact same way.
No, most people absolutely use non-linguistic, involuntary cues when judging the responses of other people.
To not do that is commonly associated with things like being on the spectrum or cognitive deficiencies.
On a message board? Do you have theories about whether people on this thread understand or don't understand what they're talking about?
You are saying no while presenting nothing to contradict what GP said.
Judging someone's external "involuntary cues" is not interrogating their internal state. It is, as you said, judging their response (a synonym for "answer") - and that judgment is also highly imperfect.
(It's worth noting that focusing so much on the someone's body language and tone that you ignore the actual words they said is a communication issue associated with not being on the spectrum, or being too allistic)
I found this overly handwavy, but I discovered that there is a non-"gentle" version of this page which is more explicit:
https://dmodel.ai/nullability/
The visualisation of how the model sees nullability was fascinating.
I'm curious if this probing of nullability could be composed with other LLM/ML-based python-typing tools to improve their accuracy.
Maybe even focusing on interfaces such as nullability rather than precise types would work better with a duck-typed language like python than inferring types directly (i.e we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric).
Why not just use a language with checked nullability? What's the point of an LLM using a duck typing language anyway?
This post actually mostly uses the subset of Python where nullability is checked. The point is not to introduce new LLM capabilities, but to understand more about how existing LLMs are reasoning about code.
> we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric
my brother in christ, you invented Typescript.
(I agree on the visualization, it's very cool!)
I am more than aware of Typescript, you seem to have misunderstood my point: I was not describing a particular type system (of which there have been many of this ilk) but rather conjecturing that targeting interfaces specifically might make LLM-based code generation/type inference more effective.
Yeah, I read that comment wrong. I didn't mean to come off like that. Sorry.
Once LLMs fully understand nullability, they will cease to use that.
Tony Hoare called it "a billion-dollar mistake" (https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retra...), Rust had made core design choices precisely to avoid this mistake.
In practical AI-assisted coding in TypeScript I have found that it is good to add in Cursor Rules to avoid anything nullable, unless it is a well-designed choice. In my experience, it makes code much better.
I don’t get the problem with null values as long as you can statically reason about them which wasn’t even the case in Java where you had to always do runtime null-guards before access.
But in Typescript, who cares? You’d be forced to handle null the same way you’d be forced to handle Maybe<T> = None | Just<T> except with extra, unidiomatic ceremony in the latter case.
What you mean with unidiomatic? If a language has
as a core concept then it's idiomatic by definition.Sounds like the process to update/jailbreak llms in a way that they don’t deny requests and always answer. There is also this direction of denial. (Article about it: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...)
Would be fun if they also „cancelled the nullability direction“.. the llms probably would start hallucinating new explanations for what is happening in the code.
> Interestingly, for models up to 1 billion parameters, the loss actually starts to increase again after reaching a minimum. This might be because as training continues, the model develops more complex, non-linear representations that our simple linear probe can’t capture as well. Or it might be that the model starts to overfit on the training data and loses its more general concept of nullability.
Double descent?
I'm curious what happens if you run the LLM with variable names that occur often with nullable variables, but then use them with code that has a non-nullable variable.
Dear future authors: please run multiple iterations and report the probability.
From: ‘Keep training it, though, and eventually it will learn to insert the None test’
To: ‘Keep training it, though, and eventually the probability of inserting the None test goes up to xx%’
The former is just horse poop, we all know LLMs generate big variance in output.
If you're interested in a more scientific treatment of the topic, the post links to a technical report which reports the numbers in detail. This post is instead an attempt to explain the topics to a more general audience, so digging into the weeds isn't very useful.
LLMs do understand nothing.
They are not reasoning.
"Validate a phone number."
The code is entirely wrong. That validates something that's close to a NAPN number but isn't actually a NAPN number. In particular the area code cannot start with 0 nor can the central office code. There are several numbers, like 911, which have special meaning, and cannot appear in either position.
You'd get better results if you went to Stack Overflow and stole the correct answer yourself. Would probably be faster too.
This is why "non technical code writing" is a terrible idea. The underlying concept is explicitly technical. What are we even doing?
This is really interesting! Intuitively it's hard to grasp that you can just subtract two average states and get a direction describing the model's perception of nullability.
The original word2vec example might be easier to understand:
As every fifth thread becomes some discussion of LLM capabilities, I think we need to shift the way we talk about this to be less like how we talk about software and more like how we talk about people.
"LLM" is a valid category of thing in the world, but it's not a thing like Microsoft Outlook that has well-defined capabilities and limitations. It's frustrating reading these discussions that constantly devolve into one person saying they tried something that either worked or didn't, then 40 replies from other people saying they got the opposite result, possibly with a different model, different version, slight prompt altering, whatever it is.
LLMs possibly have the capability to understand nullability, but that doesn't mean every instance of every model will consistently understand that or anything else. This is the same way humans operate. Humans can run a 4-minute mile. Humans can run a 10-second 100 meter dash. Humans can develop and prove novel math theorems. But not all humans, not all the time, performance depends upon conditions, timing, luck, and there has probably never been a single human who can do all three. It takes practice in one specific discipline to get really good at that, and this practice competes with or even limits other abilities. For LLMs, this manifests in differences with the way they get fine-tuned and respond to specific prompt sequences that should all be different ways of expressing the same command or query but nonetheless produce different results. This is very different from the way we are used to machines and software behaving.
Yeah the link title is overclaiming a bit, the actual post title doesn't make such a general claim, and the post itself examines several specific models and compares their understanding.
Encouraging the continued anthropomorphization of these models is a bad idea, especially in the context of discussing their capabilities.
very cool.
This is like claiming a photorestor controlled night light "understands when it is dark" or that a bimetallic strip thermostat "understands temperature". You can say those words, and it's syntactically correct but entirely incorrect semantically.
The post includes this caveat. Depending on your philosophical position about sentience you might say that LLMs can't possibly "understand" anything, and the post isn't trying to have that argument. But to the extent that an LLM can "understand" anything, you can study its understanding of nullability.
People don’t use “understand” for machines in science because people may or may not believe in the sentience of machines. That would be a weird catering to panpsychism.
Where is the boundary where this becomes semantically correct? It's easy for these kinds of discussions to go in circles, because nothing is well defined.
Hard to define something that science has yet to formally outline, and is largely still in the realm of religion.
That depends entirely on whether you believe understanding requires consciousness.
I believe that the type of understanding demonstrated here doesn't. Consciousness only comes into play when we become aware that such understanding has taken place, not on the process itself.
Shameless plug of personal blog post, but relevant. Still not fully edited, so writing is a bit scattered, but crux is we now have the framework for talking about consciousness intelligently. It's not as mysterious as in the past, considering advances in non-equilibrium thermodynamics and the Free Energy Principle in particular.
https://stefanlavelle.substack.com/p/i-am-therefore-i-feel
You declare this very plainly without evidence or argument, but this is an age-old controversial issue. It’s not self-evident to everyone, including philosophers.
It's not age-old nor is it controversial. LLMs aren't intelligent by any stretch of the imagination. Each word/token is chosen as that which is statistically most likely to follow the previous. There is no capability for understanding in the design of an LLM. It's not a matter of opinion; this just isn't how an LLM works.
Any comparison to the human brain is missing the point that an LLM only simulates one small part, and that's notably not the frontal lobe. That's required for intelligence, reasoning, self-awareness, etc.
So, no, it's not a question of philosophy. For an AI to enter that realm, it would need to be more than just an LLM with some bells and whistles; an LLM plus something else, perhaps, something fundamentally different which does not yet currently exist.
> Each word/token is chosen as that which is statistically most likely to follow the previous.
The best way to predict the weather is to have a model which approximates the weather. The best way to predict the results of a physics simulation is to have a model which approximates the physical bodies in question. The best way to predict what word a human is going to write next is to have a model that approximates human thought.
LLMs don't approximate human thought, though. They approximate language. That's it.
Please, I'm begging you, go read some papers and watch some videos about machine learning and how LLMs actually work. It is not "thinking."
I fully realize neural networks can approximate human thought -- but we are not there yet, and when we do get there, it will be something that is not an LLM, because an LLM is not capable of that -- it's not designed to be.
> LLMs don't approximate human thought, though. ...Please, I'm begging you, go read some papers and watch some videos about machine learning and how LLMs actually work.
I know how LLMs work; so let me beg you in return, listen to me for a second.
You have a theoretical-only argument: LLMs do text prediction, and therefore it is not possible for them to actually think. And since it's not possible for them to actually think, you don't need to consider any other evidence.
I'm telling you, there's a flaw in your argument: In actuality, the best way to do text prediction is to think. An LLM that could actually think would be able to do text prediction better than an LLM that can't actually think; and the better an LLM is able to approximate human thought, the better its predictions will be. The fact that they're predicting text in no way proves that there's no thinking going on.
Now, that doesn't prove that LLMs actually are thinking; but it does mean that they might be thinking. And so you should think about how you would know if they're actually thinking or not.
> it will be something that is not an LLM
I think it will be very similar in architecture.
Artificial neural networks already are approximating how neurons in a brain work, it's just at a scale that's several orders of magnitude smaller.
Our limiting factor for reaching brain-like intelligence via ANN is probably more of a hardware limitation. We would need over 100 TB to store the weights for the neurons, not to mention the ridiculous amount of compute to run it.
Isn't language expressed thought?
Language can be a (lossy) serialization of thought, yes. But language is not thought, nor inherently produced by thought. Most people agree that a process randomly producing grammatically correct sentences is not thinking.
Many people don't think we have any good evidence that our brains aren't essentially the same thing: a stochastic statistical model that produces outputs based on inputs.
Of course, you're right. Neural networks mimic exactly that after all. I'm certain we'll see an ML model developed someday that fully mimics the human brain. But my point is an LLM isn't that; it's a language model only. I know it can seem intelligent sometimes, but it's important to understand what it's actually doing and not ascribe feelings to it that don't exist in reality.
Too many people these days are forgetting this key point and putting a dangerous amount of faith in ChatGPT etc. as a result. I've seen DOCTORS using ChatGPT for diagnosis. Ignorance is scary.
If you're willing to torture the analogy you can find a way to describe literally anything as a system of outputs based on inputs. In the case of the brain to LLM comparison, people are inclined to do it because they're eager to anthropomorophize something that produces text they can interpret as a speaker, but it's totally incorrect to suggest that our brains are "essentially the same thing" as LLMs. The comparison is specious even on a surface level. It's like saying that birds and planes are "essentially the same thing" because flight was achieved by modeling planes after birds.
Thats probably the case 99% of the time.
But that 1% is pretty important.
For example, they are dismal at math problems that aren't just slight variations of problems they've seen before.
Here's one by blackandredpenn where ChatGPT insisted the solution to problem that could be solved by high school / talented middle school students was correct, even after trying to convince it it was wrong. https://youtu.be/V0jhP7giYVY?si=sDE2a4w7WpNwp6zU&t=837
Rewind earlier to see the real answer
ChatGPT o1 pro mode solved it on the first try, after 8 minutes and 53 seconds of “thinking”:
https://chatgpt.com/share/67f40cd2-d088-8008-acd5-fe9a9784f3...
The problem is how do you know that its correct ...
A human would probably say "I don't know how to solve the problem". But ChatGPT free version is confidentially wrong ..
> For example, they are dismal at math problems that aren't just slight variations of problems they've seen before.
I know plenty of teachers who would describe their students the exact same way. The difference is mostly one of magnitude (of delta in competence), not quality.
Also, I think it's important to note that by "could be solved by high school / talented middle school students" you mean "specifically designed to challenge the top ~1% of them". Because if you say "LLMs only manage to beat 99% of middle schoolers at math", the claim seems a whole lot different.
Care to share any of this good evidence?
Do biologists and neuroscientists not have any good evidence or is that just computer scientists and engineers speaking outside of their field of expertise? There's always been this danger of taking computer and brain comparisons too literally.
That argument only really applies to base models. After that we train them to give correct and helpful answers, not just answers that are statistically probable in the training data.
But even if we ignore that subtlety, it's not obvious that training a model to predict the next token doesn't lead to a world model and an ability to apply it. If you gave a human 10 physics books and told them that in a month they have a test where they have to complete sentences from the book, which strategy do you think is more successful: trying to memorize the books word by word or trying to understand the content?
The argument that understanding is just an advanced form of compression far predates LLMs. LLMs clearly lack many of the facilities humans have. Their only concept of a physical world comes from text descriptions and stories. They have a very weird form of memory, no real agency (they only act when triggered) and our attempts at replicating an internal monologue are very crude. But understanding is one thing they may well have, and if the current generation of models doesn't have it the next generation might
The thermostat analogy, and equivalents, are age-old.
Philosophers are often the last people to consider something to be settled. There's very little in the universe that they can all agree is true.
Or like saying the photoreceptors in your retina understand when it's dark. Or like claiming the temperature sensitive ion channels in your peripheral nervous system understand how hot it is.
This is a fallacy I've seen enough on here that I think it needs a name. Maybe the fallacy of Theoretical Reducibility (doesn't really roll off the tongue)?
When challenged, everybody becomes an eliminative materialist even if it's inconsistent with their other views. It's very weird.
Describing the mechanics of nervous impulses != describing consciousness.
Which is the point, since describing the mechanics of LLM architectures do not inherently grant knowledge of whether or not it is "conscious"
Or like saying that the tangled web of neurons receiving signals from these understands anything about these subjects.
I'd say the opposite also applies: to the extent LLMs have an internal language, we understand very little of it.
We’re all just elementary particles being clumped together in energy gradients, therefore my little computer project is sentient—this is getting absurd.
Well you can say it doesn't understand, but then you don't have a very useful definition of the word.
You can say this is not 'real' understanding but you like many others will be unable to clearly distinguish this 'fake' understanding from 'real' understanding in a verifiable fashion, so you are just playing a game of meaningless semantics.
You really should think about what kind of difference is supposedly so important yet will not manifest itself in any testable way - an invented one.
Sorry, this is more about the discussion of this article than the article itself. The moving goal posts that acolytes use to declare consciousness are becoming increasingly cult-y.
Who cares about consciousness? This is just a mis-direction of the discussion. Ditto for 'intelligence' and 'understanding'.
Let's talk about what they can do and where that's trending.
We spent 40 years moving the goal posts on what constitutes AI. Now we seem to have found an AI worthy of that title and instead start moving the goal posts on "consciousness", "understanding" and "intelligence".
> Now we seem to have found an AI worthy of that title and instead start moving the goal posts on "consciousness", "understanding" and "intelligence".
We didn't "find" AI, we invented systems that some people want to call AI, and some people aren't convinced it meets the bar
It is entirely reasonable for people to realize we set the bar too low when it is a bar we invented
What should the bar be? Should it be higher than it is for the average human? Or even the least intelligent human?
Personally I don't care what the bar is, honestly
Call it AI, call it LLMs, whatever
Just as long as we continue to recognize that it is a tool that humans can use, and don't start trying to treat it as a human, or as a life, and I won't complain
I'm saving my anger for when idiots start to argue that LLMs are alive and deserve human rights
there is no such bar.
We don’t even have a good way to quantify human ability. The idea that we could suddenly develop a technique to quantify human ability because we now have a piece of technology that would benefit from that quantification is absurd.
That doesn’t mean we shouldn’t try to measure the ability of an LLM. But it does mean that the techniques used to quantify an LLMs ability are not something that can be applied to humans outside of narrow focus areas.
Indeed, science is a process of discovery and adjusting goals and expectations. It is not a mountain to be summited. It is highly telling that the LLM boosters do not understand this. Those with a genuine interest in pushing forward our understanding of cognition do.
They believe that once they reach this summit everything else will be trivial problems that can be posed to the almighty AI. It's not that they don't understand the process, it's that they think AI is going to disrupt that process.
They literally believe that the AI will supersede the scientific process. It's crypto shit all over again.
Well, if that summit were reached and AI is able to improve itself trivially, I'd be willing to cede that they've reached their goal.
Anything less than that, meh.
The original meaning of mechanical Turk is about a chess hoax and how it managed to make people think it was a thinking machine. https://en.wikipedia.org/wiki/Mechanical_Turk
The current LLM anthropomorphism may soon be known as the silicon Turk. Managing to make people think they're AI.
> Now we seem to have found an AI worthy of that title and instead start moving the goal posts on "consciousness"
The goalposts already differentiated between "totally human-like" vs "actually conscious"
See also Philosophical Zombie thought experiment from the 70s.
> We spent 40 years moving the goal posts on what constitutes AI.
Who is "we"?
I think of "AI" as a pretty all-encompassing term. ChatGPT is AI, but so is the computer player in the 1995 game Command and Conquer, among thousands of other games. Heck, I might even call the ghosts in Pac-man "AI", even if their behavior is extremely simple, predictable, and even exploitable once you understand it.
Those have all been difficult words to define with much debate over that past 40 years or longer.
My joke was that the what it cant do debate changed into what it shouldn't be allowed to.
There ARE no jokes aloud on hn.
Look I’m no stranger to love. you know the rules and so do I… you can’t find this conversation with any other guy.
But since the parent was making a meta commentary on this conversation I’d like to introduce everyone here as Kettle to a friend of mine known as #000000