LLMs understand nullability

(dmodel.ai)

170 points | by mattmarcus 3 months ago ago

142 comments

lsy 3 months ago

The article puts scare quotes around "understand" etc. to try to head off critiques around the lack of precision or scientific language, but I think this is a really good example of where casual use of these terms can get pretty misleading.

Because code LLMs have been trained on the syntactic form of the program and not its execution, it's not correct — even if the correlation between variable annotations and requested completions was perfect (which it's not) — to say that the model "understands nullability", because nullability means that under execution the variable in question can become null, which is not a state that it's possible for a model trained only on a million programs' syntax to "understand". You could get the same result if e.g. "Optional" means that the variable becomes poisonous and checking "> 0" is eating it, and "!= None" is an antidote. Human programmers can understand nullability because they've hopefully run programs and understand the semantics of making something null.

The paper could use precise, scientific language (e.g. "the presence of nullable annotation tokens correlates to activation of vectors corresponding to, and emission of, null-check tokens with high precision and accuracy") which would help us understand what we can rely on the LLM to do and what we can't. But it seems like there is some subconscious incentive to muddy how people see these models in the hopes that we start ascribing things to them that they aren't capable of.

[-]

waldrews 3 months ago

I was going to say "so you believe the LLM's don't have the capacity to understand" but then I realized that the precise language would be something like "the presence of photons in this human's retinas in patterns encoding statements about LLM's having understanding correlates to the activation of neuron signaling chains corresponding to, and emission of, muscle activations engaging keyboard switches, which produce patterns of 'no they don't' with high frequency."

The critiques of mental state applied to the LLM's are increasingly applicable to us biologicals, and that's the philosophical abyss we're staring down.

[-]

mjburgess 3 months ago

No it's not. He gave you modal conditions on "understanding", he said: predicting the syntax of valid programs, and their operational semantics, ie., the behaviour of the computer as it runs.

I would go much further than this; but this is a de minimus criteria that the LLM already fails.

What zealots eventually discover is that they can hold their "fanatical proposition" fixed in the face of all opposition to the contrary, by tearing down the whole edifice of science, knowledge, and reality itself.

If you wish to assert, against any reasonable thought, that the sky is a pink dome you can do so -- first that our eyes are broken, and then, eventually that we live in some paranoid "philosophical abyss" carefully constructed to permit your paranoia.

This abursidty is exhausting, and I'd wish one day to find fanatics who'd realise it quickly and abate it -- but alas, I have never.

If you find yourself hollowing-out the meaning of words to the point of making no distinctions, denying reality to reality itself, and otherwise arriving at a "philosophical abyss" be aware that it is your cherished propositions which are the maddness and nothing else.

Here: no, the LLM does not understand. Yes, we do. It is your job to begin from reasonable premises and abduce reasonable theories. If you do not, you will not.

[-]

og_kalu 3 months ago

>No it's not. He gave you modal conditions on "understanding", he said: predicting the syntax of valid programs, and their operational semantics, ie., the behaviour of the computer as it runs.

LLMs are perfectly capable of predicting the behavior of programs. You don't have to take my word for it, you can test it yourself. So he gave modal conditions they already satisfy. Can I conclude they understand now ?

>If you find yourself hollowing-out the meaning of words to the point of making no distinctions, denying reality to reality itself, and otherwise arriving at a "philosophical abyss" be aware that it is your cherished propositions which are the maddness and nothing else.

The only people denying reality are those who insist that it is not 'real' understanding and yet cannot distinguish this 'real' from 'fake' property in any verifiable manner, the very definition of an invented property.

Your argument boils down to 'I think it's so absurd so it cannot be so'. That's the best you can do ? Do you genuinely think that's a remotely convincing argument ?

[-]

kannanvijayan 3 months ago

LLMs are reasonably competent at surfacing the behaviour of simple programs when the behaviour of those programs is a relatively straightforward structural extension of enough of its training set that it's managed to correlate together.

It's very clear that LLMs lack understanding when you use them for anything remotely sophisticated. I say this as someone who leverages them extensively on a daily basis - mostly for development. They're very powerful tools and I'm grateful for their existence and the force multiplier they represent.

Try to get one to act as a storyteller and the limitations in understanding glare out. You try to goad some creativity and character out of it and it spits out generally insipid recombinations of obvious tropes.

In programming, I use AI strictly as a auto-complete extension. Even in that limited context, the latest models make trivial mistakes in certain circumstances that reveal their lack of understanding. The ones that stand out are the circumstances where the local change to make is very obvious and simple, but the context of the code is something that the ML hasn't seen before.

In those cases, I see them slapping together code that's semantically wrong in the local context, but pattern matches well against the outer context.

It's very clear that the ML doesn't even have a SIMPLE understanding of the language semantics, despite having been trained on presumably multiple billions of lines of code from all sorts of different programming languages.

If you train a human against half a dozen programming languages, you can readily expect by the end of that training that they will, all by themselves, simply through mastering the individual languages, have constructed their own internal generalized models of programming languages as a whole, and would become aware of some semantic generalities. And if I had asked that human to make that same small completion for me, they would have gotten it right. They would have understood that the language semantics are a stronger implicit context compared to the surrounding syntax.

MLs just don't do that. They're very impressive tools, and they are a strong step forward toward some machine model of understanding (sophisticated pattern matching is likely a fundamental prerequisite for understanding), but ascribing understanding to them at this point is jumping the gun. They're not there yet.

[-]

og_kalu 3 months ago

>LLMs are reasonably competent at surfacing the behaviour of simple programs when the behaviour of those programs is a relatively straightforward structural extension of enough of its training set that it's managed to correlate together. It's very clear that LLMs lack understanding when you use them for anything remotely sophisticated.

No, because even those 'sophisticated' examples still get very non trivial attempts. If I were to use the same standard of understanding we ascribe to humans, I would rarely class LLMs as having no understanding of some topic. Understanding does not mean perfection or the absence of mistakes, except in fiction and our collective imaginations.

>Try to get one to act as a storyteller and the limitations in understanding glare out. You try to goad some creativity and character out of it and it spits out generally insipid recombinations of obvious tropes.

I do and creativity is not really the issue with some of the new SOTA. I mean i understand what you are saying - default prose often isn't great and every single model besides 2.5-pro cannot handle details/story instructions for longform writing without essentially collapsing but it's not really creativity that's the problem.

>The ones that stand out are the circumstances where the local change to make is very obvious and simple

Obvious and simple to you maybe but with auto-complete, the context the model actually has is dubious at best. It's not like copilot is pasting all the code in 10 files if you have 10 files open. What actually gets in in the context of auto-complete is fairly beyond your control with no way to see what is getting the cut and what isn't.

I don't use auto-complete very often. For me, it doesn't compare to pasting in relevant code myself and asking for what I want. We have very different experiences.

shafyy 3 months ago

Countering the argument that LLMs are just gloriefied probability machines and do not undertand or think with "how do you know humans are not the same" has been the biggest achievement of AI hypemen (and yes, it's mostly men).

Of course, now you can say "how do you know that our brains are not just efficient computers that run LLMs", but I feel like the onus of proof lies on the makers of this claim, not on the other side.

It is very likely that human intelligence is not just autocomplete on crack, given all we know about neuroscience so far.

[-]

BobbyTables2 3 months ago

I’ve come to realize AI works as well as it does because it was trained extensively on the same kinds of things people normally ask. So, it already has the benefit of vast amounts of human responses.

Of course, ask it a PhD level question and it will confidently hallucinate more than Beavis & Butthead.

It really is a damn glorified autocomplete, unfortunately very useful as a search engine replacement.

[-]

uh_uh 3 months ago

The LLM is a glorified autocomplete in as much as you are a glorified replicator. Yes, it was trained on autocomplete but that doesn't say much about what capabilities might emerge.

[-]

shafyy 3 months ago

> Yes, it was trained on autocomplete but that doesn't say much about what capabilities might emerge.

No, but we know how it works and it is just a stochastic parrot. There is no magic in there.

What is more suprising to me that humans are so predictable that a glorified autocomplete works this well. Then again, it's not that suprising....

[-]

uh_uh 3 months ago

Sorry but this is nonsense. Do you have a theory about when certain LLM capabilities emerge? AFAIK we don't have a good theory about when and why they do emerge.

But even if knew how something works (which in present case we don't), shouldn't diminish our opinion of it. Will you have a lesser opinion of human intelligence, once we figure out how it works?

[-]

slowmovintarget 3 months ago

There has been, to date, no demonstrated emergence from LLMs. There has been probabilistic drift in their outputs based on their inputs (training set, training time, reinforcement, fine-tuning, system prompts, and inference parameters). All of these effects on outputs are predictable, and all are first order effects. We don't have any emergence yet.

We do have proofs that hallucination will always be a problem. We have proofs that the "reasoning" for models that "think" are actually regurgitation of human explanations written out. When asked to do truly novel things, the models fail. When asked to do high-precision things, the models fail. When asked to do high-accuracy things, the models fail.

LLMs don't understand. They are search engines. We are experience engines, and philosophically, we don't have a way to tokenize experience, we can only tokenize its description. So while LLMs can juggle descriptions all day long, these algorithms do so disconnected from the underlying experiences required for understanding.

[-]

uh_uh 3 months ago

Examples of emergence:

1. Multi-step reasoning with backtracking when DeepSeek R1 was trained via GRPO.

2. Translation of languages they haven't even seen via in-context learning.

3. Arithmetic: heavily correlated with model size, but it does appear.

I could go on.

Albeit it's not an LLM, but a deep learning model trained via RL, would you say that AlphaZero's move 37 also doesn't count as emergence and the model has no understanding of Go?

sfn42 3 months ago

I'm sure at any given point there's hundreds of this exact discussion occurring in various threads on HN.

LLMs are cool, a lot of people find them useful. Hype bros are full of crap and there's no point arguing with them because it's always a pointless discussion. With crypto and nfts it's future predictions which are just inherently impossible to reason about, with ai it's partially that, and partially the whole "do they have human properties" thing which is equally impossible to reason about.

It gets discussed to death every single day.

[-]

shafyy 3 months ago

100%

shafyy 3 months ago

> Do you have a theory about when certain LLM capabilities emerge?

We do know how LLMs work, correct? We also know what they are capable of and what not (of course this line is often blurred by hype).

I am not an expert at all on LLMs or neuroscience. But it is apparent that having a discussion with a human vs. with an LLM is a completely different ballpark. I am not saying that we will never have technology that can "understand" and "think" like a human does. I am just saying, this is not it.

Also, just because a lot of progress in LLMs has been made in the past 5 years, that we can just extrapolate the future progress on this. Local maxima and technology limits are a thing.

[-]

uh_uh 3 months ago

> We do know how LLMs work, correct?

NO! We have working training algorithms. We still don't have a complete understanding of why deep learning works in practice, and especially not why it works at the current level of scale. If you disagree, please cite me the papers because I'd love to read them.

To put in another way: Just because you can breed dogs, it doesn't necessary mean that you have a working theory of genes or even that you know they exist. Which was actually the human condition for most of history.

[-]

shafyy 3 months ago

We do know in general how LLMs work. Now, it's of course not always possible to say why a specific output is generated given an input, but we do know HOW it does it.

To translate it to your analogy with dogs: We do know how the anatomy of dogs work, but we do not know why they sometimes fetch the stick and sometimes not.

mlinhares 3 months ago

BuT iT CoUlD Be, cAn YoU PrOvE ThAT IT is NOt?

I'm having a great experience using Cursor, but i don't feel like trying to overhype it, it just makes me tired to see all this hype. Its a great tool, makes me more productive, nothing beyond that.

[-]

shafyy 3 months ago

That's great for you. I'm not diminishing your experience or taking it away. I think we agree on the hype.

xigency 3 months ago

This only applies to people who understand how computers and computer programs work, because someone who doesn't externalize their thinking process would never ascribe human elements of consciousness to inanimate materials.

Certainly many ancient people worshiped celestial objects or crafted idols by their own hands and ascribed to them powers greater than themselves. That doesn't really help in the long run compared to taking personal responsibility for one's own actions and motives, the best interests of their tribe or community, and taking initiative to understand the underlying cause of mysterious phenomena.

[-]

3 months ago

[deleted]

uh_uh 3 months ago

We don't really have a clue what they are and aren't capable of. Prior to the LLM-boom, many people – and I include myself in this – thought it'd be impossible to get to the level of capability we have now purely from statistical methods and here we are. If you have a strong theory that proves some bounds on LLM-capability, then please put it forward. In the absence of that, your sceptical attitude is just as sus as the article's.

[-]

Baeocystin 3 months ago

I majored in CogSci at UCSD in the 90's. I've been interested and active in the machine learning world for decades. The LLM boom took me completely and utterly by surprise, continues to do so, and frankly I am most mystified by the folks who downplay it. These giant matrixes are already so far beyond what we thought was (relatively) easily achievable that even if progress stopped tomorrow, we'd have years of work to put in to understand how we got here. Doesn't mean we've hit AGI, but what we already have is truly remarkable.

[-]

chihuahua 3 months ago

The funny thing is that 1/3 of people think LLMs are dumb and will never amount to anything. Another third think that it's already too late to prevent the rise of superhuman AGI that will destroy humanity, and are calling for airstrikes on any data center that does not submit to their luddite rules. And the last third use LLMs for writing small pieces of code.

Workaccount2 3 months ago

Pretty much until 2022, the de facto orthodoxy for AI was "The creative pursuits will forever be outside the reach of computers".

People are pretty quiet about creative pursuits actually being the low hanging fruit on the AI tree.

kubav027 3 months ago

LLM also have no idea what it is capable of. This feels like difference to humans. Having some understanding of the problem also means knowing or "feeling" the limits of that understanding.

[-]

uh_uh 3 months ago

1. Many humans don't have an idea of the limits of their competence. It's called the Dunning–Kruger effect.

2. LLMs regularly tell me if what I'm asking for is possible or not. I'm not saying they're always correct, but they seem to have at least some sense of what's in the realm of possibility.

[-]

kubav027 3 months ago

1. Dunning-kruger effect describes difference in expected and real performance. It is not saying that humans confidently give wrong answers if they do not know correct ones.

2. That is not my experience. Almost half of the time LLM gives wrong answer without any warning. It is up to me to check correctness. Even if I follow up it often continues to give wrong answers.

[-]

uh_uh 3 months ago

1. "It is not saying that humans confidently give wrong answers if they do not know correct ones." And I didn't say that they do either, so you might have hallucinated that.

2. What are you arguing about? I didn't say they're always correct obviously.

3 months ago

[deleted]

3 months ago

[deleted]

wvenable 3 months ago

> Because code LLMs have been trained on the syntactic form of the program and not its execution

One of the very first tests I did of ChatGPT way back when it was new was give it a relatively complex string manipulation function from our code base, strip all identifying materials from the code (variable names, the function name itself, etc), and then provide it with inputs and ask it for the outputs. I was surprised that it could correctly generate the output from the input.

So it does have some idea of what the code actually does not just syntax.

creatonez 3 months ago

> Because code LLMs have been trained on the syntactic form of the program and not its execution

What makes you think this? It has been trained on plenty of logs and traces, discussion of the behavior of various code, REPL sessions, etc. Code LLMs are trained on all human language and wide swaths of whatever machine-generated text is available, they are not restricted to just code.

hatthew 3 months ago

I am slowly coming around to the idea that nobody should ever use the word "understand" in relation to LLMs, simply because everyone has their own definition of "understand", and many of these definitions disagree, and people tend to treat their definition as axiomatic. I have yet to see any productive discussion happen once anyone disagrees on the definition of "understand".

So, what word would you propose we use to mean "an LLM's ability (or lack thereof) to output generally correct sentences about the topic at hand"?

[-]

nomonnai 3 months ago

It's a prediction of what humans have frequently produced in similar situations.

yujzgzc 3 months ago

How do you know that these models haven't been trained by running programs?

At least, it's likely that they've been trained on undergrad textbooks that explain program behaviors and contain exercises.

globnomulous 3 months ago

This is essentially John Searle's Chinese Room Argument against strong AI. His conclusion is broader -- he argues categorically against the very possibility of so-called "strong" AI, viz. AI that understands, not just against the narrower notion that LLMs "understand" -- but the reasoning is essentially identical.

Here's the Stanford Encyclopedia of Philosophy's superb write up: https://plato.stanford.edu/entries/chinese-room/ It covers, engagingly and with playful wit, not just Searle's original argument but its evolution in his writing, other philosopher's responses/criticisms, and Searle's counter-responses.

3 months ago

[deleted]

aoeusnth1 3 months ago

As far as you know, AI labs are doing E2E RL training with running code in the loop to advance the model's capability to act as an agent (for cursor et al).

sega_sai 3 months ago

One thing that is exciting in the text is an attempt to go away from describing whether LLM 'understands' which I would argue an ill posed question, but instead rephrase it in terms of something that can actually be measured.

It would be good to list a few possible ways of interpreting 'understanding of code'. It could possibly include: 1) Type inference for the result 2) nullability 3) runtime asymptotics 4) What the code does

[-]

kazinator 3 months ago

5) predicting a bunch of language tokens from the compressed database of knowledge encoded as weights, calculated out of numerous examples that exploit nullability in code and talk about it in accompanying text.

empath75 3 months ago

Is there any way you can tell whether a human understands something other than by asking them a question and judging their answer?

Nobody interrogates each other's internal states when judging whether someone understands a topic. All we can judge it based on are the words they produce or the actions they take in response to a situation.

The way that systems or people arrive at a response is sort of an implementation detail that isn't that important when judging whether a system does or doesn't understand something. Some people understand a topic on an intuitive, almost unthinking level, and other people need to carefully reason about it, but they both demonstrate understanding by how they respond to questions about it in the exact same way.

[-]

cess11 3 months ago

No, most people absolutely use non-linguistic, involuntary cues when judging the responses of other people.

To not do that is commonly associated with things like being on the spectrum or cognitive deficiencies.

[-]

LordDragonfang 3 months ago

You are saying no while presenting nothing to contradict what GP said.

Judging someone's external "involuntary cues" is not interrogating their internal state. It is, as you said, judging their response (a synonym for "answer") - and that judgment is also highly imperfect.

(It's worth noting that focusing so much on the someone's body language and tone that you ignore the actual words they said is a communication issue associated with not being on the spectrum, or being too allistic)

[-]

cess11 3 months ago

Are you sure?

"Nobody interrogates each other's internal states when judging whether someone understands a topic. All we can judge it based on are the words they produce or the actions they take in response to a situation."

I'm fairly sure I wrote something that contradicts these two sentences.

You can commonly figure out whether people are leaking information about how they feel and what emotional and cognitive states they are in, even in text based communication. To some extent you can also decide the internal state of another person, and it's often easier to do with text than in direct communication, i.e. two bodies in the same space.

This idea that people aren't malleable and possible to interrogate without them noticing would be very surprising to anyone making a living from advertising.

[-]

LordDragonfang 3 months ago

I don't know how to further explain that a person's external facial and bodily responses are literally their external state and not their internal state. Your interpretation of those is not actual knowledge of their internal state. To be even more plain, it's a guess based on external state. I know this is something that allistic people have trouble understanding, though.

You're mistaking the shadow on the cave wall for the thing itself. Inference is not the same as observation.

For the same reason that polygraphs are not lie detectors, the body language that you're convinced is "leaking" is not actually internal state.

[-]

cess11 3 months ago

I think most people have experience with situations where someone else has tried to hide their true feelings and still been able to figure it out due to involuntary cues. If you were to ever look at some popular television shows you'll find that this is a common basis for both drama and comedy and widely understood by people all over the world.

It might come as a surprise, but everything you've ever experienced is just shadows on the cave wall that is your cerebral cortex. You've never ever been outside of that cave and have always been at the mercy of a mammal brain preparing and projecting those shadows for you.

dambi0 3 months ago

The fact we have labels for communication problems caused by failure to understand non-verbal cues doesn’t tell us that non-verbal cues are necessary for understanding

[-]

cess11 3 months ago

I didn't say they are always necessary so you probably replied in the wrong place.

empath75 3 months ago

On a message board? Do you have theories about whether people on this thread understand or don't understand what they're talking about?

[-]

cess11 3 months ago

Why would you add this constraint now?

btown 3 months ago

There seems to be a typo in OP's "Visualizing Our Results" - but things make perfect sense if red is non-nullable, green is nullable.

I'd be really curious to see where the "attention" heads of the LLM look when evaluating the nullability of any given token. Does just it trust the Optional[int] return type signature of the function, or does it also skim through the function contents to understand whether that's correct?

It's fascinating to me to think that the senior developer skillset of being able to skim through complicated code, mentally make note of different tokens of interest where assumptions may need to be double-checked, and unravel that cascade of assumptions to track down a bug, is something that LLMs already excel at.

Sure, nullability is an example where static type checkers do well, and it makes the article a bit silly on its own... but there are all sorts of assumptions that aren't captured well by type systems. There's been a ton of focus on LLMs for code generation; I think that LLMs for debugging makes for a fascinating frontier.

[-]

aSanchezStern 3 months ago

Thanks for pointing that out, it's fixed now.

gopiandcode 3 months ago

The visualisation of how the model sees nullability was fascinating.

I'm curious if this probing of nullability could be composed with other LLM/ML-based python-typing tools to improve their accuracy.

Maybe even focusing on interfaces such as nullability rather than precise types would work better with a duck-typed language like python than inferring types directly (i.e we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric).

[-]

jayd16 3 months ago

Why not just use a language with checked nullability? What's the point of an LLM using a duck typing language anyway?

[-]

aSanchezStern 3 months ago

This post actually mostly uses the subset of Python where nullability is checked. The point is not to introduce new LLM capabilities, but to understand more about how existing LLMs are reasoning about code.

qsort 3 months ago

> we don't really care if a variable is an int specifically, but rather that it supports _add or _sub etc. that it is numeric

my brother in christ, you invented Typescript.

(I agree on the visualization, it's very cool!)

[-]

gopiandcode 3 months ago

I am more than aware of Typescript, you seem to have misunderstood my point: I was not describing a particular type system (of which there have been many of this ilk) but rather conjecturing that targeting interfaces specifically might make LLM-based code generation/type inference more effective.

[-]

qsort 3 months ago

Yeah, I read that comment wrong. I didn't mean to come off like that. Sorry.

kazinator 3 months ago

LLMs "understand" nullability to the extent that texts they have been trained on contain examples of nullability being used in code, together with remarks about it in natural language. When the right tokens occur in your query, other tokens get filled in from that data in a clever way. That's all there is to it.

The LLM will not understand, and is incapable of developing an understanding, of a concept not present in its training data.

If try to teach it the basics of the misunderstood concept in your chat, it will reflect back a verbal acknowledgement, restated in different words, with some smoothly worded embellishments which looks like the external trappings of understanding. It's only a mirage though.

The LLM will code anything, no matter how novel, if you give it detailed enough instructions and clarifications. That's just a a language translation task from pseudo-code to code. Being a language model, it's designed for that.

LLM is like the bar waiter who has picked up on economics and politics talk, and is able to interject with something clever sounding, to the surprise of the patrons. Gee, how does he or she understand the workings of the international monetary fund, and what the hell are they doing working in this bar?

[-]

ghc 3 months ago

Great analogy at the end! I'm going to have to steal this, because it hits right at the heart of the problem with relying on LLMs to do things outside of what they were designed for.

stared 3 months ago

Once LLMs fully understand nullability, they will cease to use that.

Tony Hoare called it "a billion-dollar mistake" (https://en.wikipedia.org/wiki/Tony_Hoare#Apologies_and_retra...), Rust had made core design choices precisely to avoid this mistake.

In practical AI-assisted coding in TypeScript I have found that it is good to add in Cursor Rules to avoid anything nullable, unless it is a well-designed choice. In my experience, it makes code much better.

[-]

hombre_fatal 3 months ago

I don’t get the problem with null values as long as you can statically reason about them which wasn’t even the case in Java where you had to always do runtime null-guards before access.

But in Typescript, who cares? You’d be forced to handle null the same way you’d be forced to handle Maybe<T> = None | Just<T> except with extra, unidiomatic ceremony in the latter case.

[-]

ngruhn 3 months ago

What you mean with unidiomatic? If a language has

    Maybe<T> = None | Just<T>

as a core concept then it's idiomatic by definition.

[-]

hombre_fatal 3 months ago

Typescript doesn't define a Maybe<T> nor do you need it to have idiomatic statically-typed nullability.

It already has:

type value = string | null

gwern 3 months ago

> Interestingly, for models up to 1 billion parameters, the loss actually starts to increase again after reaching a minimum. This might be because as training continues, the model develops more complex, non-linear representations that our simple linear probe can’t capture as well. Or it might be that the model starts to overfit on the training data and loses its more general concept of nullability.

Double descent?

nonameiguess 3 months ago

As every fifth thread becomes some discussion of LLM capabilities, I think we need to shift the way we talk about this to be less like how we talk about software and more like how we talk about people.

"LLM" is a valid category of thing in the world, but it's not a thing like Microsoft Outlook that has well-defined capabilities and limitations. It's frustrating reading these discussions that constantly devolve into one person saying they tried something that either worked or didn't, then 40 replies from other people saying they got the opposite result, possibly with a different model, different version, slight prompt altering, whatever it is.

LLMs possibly have the capability to understand nullability, but that doesn't mean every instance of every model will consistently understand that or anything else. This is the same way humans operate. Humans can run a 4-minute mile. Humans can run a 10-second 100 meter dash. Humans can develop and prove novel math theorems. But not all humans, not all the time, performance depends upon conditions, timing, luck, and there has probably never been a single human who can do all three. It takes practice in one specific discipline to get really good at that, and this practice competes with or even limits other abilities. For LLMs, this manifests in differences with the way they get fine-tuned and respond to specific prompt sequences that should all be different ways of expressing the same command or query but nonetheless produce different results. This is very different from the way we are used to machines and software behaving.

[-]

aSanchezStern 3 months ago

Yeah the link title is overclaiming a bit, the actual post title doesn't make such a general claim, and the post itself examines several specific models and compares their understanding.

root_axis 3 months ago

Encouraging the continued anthropomorphization of these models is a bad idea, especially in the context of discussing their capabilities.

apples_oranges 3 months ago

Sounds like the process to update/jailbreak llms in a way that they don’t deny requests and always answer. There is also this direction of denial. (Article about it: https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in...)

Would be fun if they also „cancelled the nullability direction“.. the llms probably would start hallucinating new explanations for what is happening in the code.

plaineyjaney 3 months ago

This is really interesting! Intuitively it's hard to grasp that you can just subtract two average states and get a direction describing the model's perception of nullability.

[-]

nick__m 3 months ago

The original word2vec example might be easier to understand:

  vec(King) - vec(Man) + vec(Woman) = vec(Queen)

amelius 3 months ago

I'm curious what happens if you run the LLM with variable names that occur often with nullable variables, but then use them with code that has a non-nullable variable.

[-]

aSanchezStern 3 months ago

The answer it seems is, it depends on what kind of code you're looking at. The post showed that `for` loops cause a lot more variable-name-biased reasoning, while `ifs` and function defs/calls are more variable-name independent.

kmod 3 months ago

I found this overly handwavy, but I discovered that there is a non-"gentle" version of this page which is more explicit:

https://dmodel.ai/nullability/

[-]

aSanchezStern 3 months ago

Yeah that's linked a couple of times in the post

timewizard 3 months ago

"Validate a phone number."

The code is entirely wrong. That validates something that's close to a NAPN number but isn't actually a NAPN number. In particular the area code cannot start with 0 nor can the central office code. There are several numbers, like 911, which have special meaning, and cannot appear in either position.

You'd get better results if you went to Stack Overflow and stole the correct answer yourself. Would probably be faster too.

This is why "non technical code writing" is a terrible idea. The underlying concept is explicitly technical. What are we even doing?

tanvach 3 months ago

Dear future authors: please run multiple iterations and report the probability.

From: ‘Keep training it, though, and eventually it will learn to insert the None test’

To: ‘Keep training it, though, and eventually the probability of inserting the None test goes up to xx%’

The former is just horse poop, we all know LLMs generate big variance in output.

[-]

aSanchezStern 3 months ago

If you're interested in a more scientific treatment of the topic, the post links to a technical report which reports the numbers in detail. This post is instead an attempt to explain the topics to a more general audience, so digging into the weeds isn't very useful.

ashoeafoot 3 months ago

nstate programming is an antipattern use railway orientated programming instead.

thmorriss 3 months ago

very cool.

EncomLab 3 months ago

This is like claiming a photorestor controlled night light "understands when it is dark" or that a bimetallic strip thermostat "understands temperature". You can say those words, and it's syntactically correct but entirely incorrect semantically.

[-]

aSanchezStern 3 months ago

The post includes this caveat. Depending on your philosophical position about sentience you might say that LLMs can't possibly "understand" anything, and the post isn't trying to have that argument. But to the extent that an LLM can "understand" anything, you can study its understanding of nullability.

[-]

keybored 3 months ago

People don’t use “understand” for machines in science because people may or may not believe in the sentience of machines. That would be a weird catering to panpsychism.

nsingh2 3 months ago

Where is the boundary where this becomes semantically correct? It's easy for these kinds of discussions to go in circles, because nothing is well defined.

[-]

nativeit 3 months ago

Hard to define something that science has yet to formally outline, and is largely still in the realm of religion.

[-]

EMIRELADERO 3 months ago

That depends entirely on whether you believe understanding requires consciousness.

I believe that the type of understanding demonstrated here doesn't. Consciousness only comes into play when we become aware that such understanding has taken place, not on the process itself.

stefl14 3 months ago

Shameless plug of personal blog post, but relevant. Still not fully edited, so writing is a bit scattered, but crux is we now have the framework for talking about consciousness intelligently. It's not as mysterious as in the past, considering advances in non-equilibrium thermodynamics and the Free Energy Principle in particular.

https://stefanlavelle.substack.com/p/i-am-therefore-i-feel

fallingknife 3 months ago

Or like saying the photoreceptors in your retina understand when it's dark. Or like claiming the temperature sensitive ion channels in your peripheral nervous system understand how hot it is.

[-]

throw4847285 3 months ago

This is a fallacy I've seen enough on here that I think it needs a name. Maybe the fallacy of Theoretical Reducibility (doesn't really roll off the tongue)?

When challenged, everybody becomes an eliminative materialist even if it's inconsistent with their other views. It's very weird.

3 months ago

[deleted]

throwuxiytayq 3 months ago

Or like saying that the tangled web of neurons receiving signals from these understands anything about these subjects.

nativeit 3 months ago

Describing the mechanics of nervous impulses != describing consciousness.

[-]

LordDragonfang 3 months ago

Which is the point, since describing the mechanics of LLM architectures do not inherently grant knowledge of whether or not it is "conscious"

robotresearcher 3 months ago

You declare this very plainly without evidence or argument, but this is an age-old controversial issue. It’s not self-evident to everyone, including philosophers.

[-]

fwip 3 months ago

Philosophers are often the last people to consider something to be settled. There's very little in the universe that they can all agree is true.

mubou 3 months ago

It's not age-old nor is it controversial. LLMs aren't intelligent by any stretch of the imagination. Each word/token is chosen as that which is statistically most likely to follow the previous. There is no capability for understanding in the design of an LLM. It's not a matter of opinion; this just isn't how an LLM works.

Any comparison to the human brain is missing the point that an LLM only simulates one small part, and that's notably not the frontal lobe. That's required for intelligence, reasoning, self-awareness, etc.

So, no, it's not a question of philosophy. For an AI to enter that realm, it would need to be more than just an LLM with some bells and whistles; an LLM plus something else, perhaps, something fundamentally different which does not yet currently exist.

[-]

aSanchezStern 3 months ago

Many people don't think we have any good evidence that our brains aren't essentially the same thing: a stochastic statistical model that produces outputs based on inputs.

[-]

mubou 3 months ago

Of course, you're right. Neural networks mimic exactly that after all. I'm certain we'll see an ML model developed someday that fully mimics the human brain. But my point is an LLM isn't that; it's a language model only. I know it can seem intelligent sometimes, but it's important to understand what it's actually doing and not ascribe feelings to it that don't exist in reality.

Too many people these days are forgetting this key point and putting a dangerous amount of faith in ChatGPT etc. as a result. I've seen DOCTORS using ChatGPT for diagnosis. Ignorance is scary.

goatlover 3 months ago

Do biologists and neuroscientists not have any good evidence or is that just computer scientists and engineers speaking outside of their field of expertise? There's always been this danger of taking computer and brain comparisons too literally.

root_axis 3 months ago

If you're willing to torture the analogy you can find a way to describe literally anything as a system of outputs based on inputs. In the case of the brain to LLM comparison, people are inclined to do it because they're eager to anthropomorophize something that produces text they can interpret as a speaker, but it's totally incorrect to suggest that our brains are "essentially the same thing" as LLMs. The comparison is specious even on a surface level. It's like saying that birds and planes are "essentially the same thing" because flight was achieved by modeling planes after birds.

SJC_Hacker 3 months ago

Thats probably the case 99% of the time.

But that 1% is pretty important.

For example, they are dismal at math problems that aren't just slight variations of problems they've seen before.

Here's one by blackandredpenn where ChatGPT insisted the solution to problem that could be solved by high school / talented middle school students was correct, even after trying to convince it it was wrong. https://youtu.be/V0jhP7giYVY?si=sDE2a4w7WpNwp6zU&t=837

Rewind earlier to see the real answer

[-]

LordDragonfang 3 months ago

> For example, they are dismal at math problems that aren't just slight variations of problems they've seen before.

I know plenty of teachers who would describe their students the exact same way. The difference is mostly one of magnitude (of delta in competence), not quality.

Also, I think it's important to note that by "could be solved by high school / talented middle school students" you mean "specifically designed to challenge the top ~1% of them". Because if you say "LLMs only manage to beat 99% of middle schoolers at math", the claim seems a whole lot different.

jquery 3 months ago

ChatGPT o1 pro mode solved it on the first try, after 8 minutes and 53 seconds of “thinking”:

https://chatgpt.com/share/67f40cd2-d088-8008-acd5-fe9a9784f3...

[-]

SJC_Hacker 3 months ago

The problem is how do you know that its correct ...

A human would probably say "I don't know how to solve the problem". But ChatGPT free version is confidentially wrong ..

nativeit 3 months ago

Care to share any of this good evidence?

gwd 3 months ago

> Each word/token is chosen as that which is statistically most likely to follow the previous.

The best way to predict the weather is to have a model which approximates the weather. The best way to predict the results of a physics simulation is to have a model which approximates the physical bodies in question. The best way to predict what word a human is going to write next is to have a model that approximates human thought.

[-]

mubou 3 months ago

LLMs don't approximate human thought, though. They approximate language. That's it.

Please, I'm begging you, go read some papers and watch some videos about machine learning and how LLMs actually work. It is not "thinking."

I fully realize neural networks can approximate human thought -- but we are not there yet, and when we do get there, it will be something that is not an LLM, because an LLM is not capable of that -- it's not designed to be.

[-]

gwd 3 months ago

> LLMs don't approximate human thought, though. ...Please, I'm begging you, go read some papers and watch some videos about machine learning and how LLMs actually work.

I know how LLMs work; so let me beg you in return, listen to me for a second.

You have a theoretical-only argument: LLMs do text prediction, and therefore it is not possible for them to actually think. And since it's not possible for them to actually think, you don't need to consider any other evidence.

I'm telling you, there's a flaw in your argument: In actuality, the best way to do text prediction is to think. An LLM that could actually think would be able to do text prediction better than an LLM that can't actually think; and the better an LLM is able to approximate human thought, the better its predictions will be. The fact that they're predicting text in no way proves that there's no thinking going on.

Now, that doesn't prove that LLMs actually are thinking; but it does mean that they might be thinking. And so you should think about how you would know if they're actually thinking or not.

Sohcahtoa82 3 months ago

> it will be something that is not an LLM

I think it will be very similar in architecture.

Artificial neural networks already are approximating how neurons in a brain work, it's just at a scale that's several orders of magnitude smaller.

Our limiting factor for reaching brain-like intelligence via ANN is probably more of a hardware limitation. We would need over 100 TB to store the weights for the neurons, not to mention the ridiculous amount of compute to run it.

[-]

codedokode 3 months ago

> not to mention the ridiculous amount of compute to run it.

How does the brain computes the weights then? Or maybe your assumption than brain is equivalent to a mathematical NN is wrong?

yahoozoo 3 months ago

How much compute do you think the human brain uses? They're training these LLMs with (hundreds of) thousands of GPUs.

handfuloflight 3 months ago

Isn't language expressed thought?

[-]

fwip 3 months ago

Language can be a (lossy) serialization of thought, yes. But language is not thought, nor inherently produced by thought. Most people agree that a process randomly producing grammatically correct sentences is not thinking.

wongarsu 3 months ago

That argument only really applies to base models. After that we train them to give correct and helpful answers, not just answers that are statistically probable in the training data.

But even if we ignore that subtlety, it's not obvious that training a model to predict the next token doesn't lead to a world model and an ability to apply it. If you gave a human 10 physics books and told them that in a month they have a test where they have to complete sentences from the book, which strategy do you think is more successful: trying to memorize the books word by word or trying to understand the content?

The argument that understanding is just an advanced form of compression far predates LLMs. LLMs clearly lack many of the facilities humans have. Their only concept of a physical world comes from text descriptions and stories. They have a very weird form of memory, no real agency (they only act when triggered) and our attempts at replicating an internal monologue are very crude. But understanding is one thing they may well have, and if the current generation of models doesn't have it the next generation might

robotresearcher 3 months ago

The thermostat analogy, and equivalents, are age-old.

dleeftink 3 months ago

I'd say the opposite also applies: to the extent LLMs have an internal language, we understand very little of it.

dfilppi 3 months ago

[dead]

nativeit 3 months ago

We’re all just elementary particles being clumped together in energy gradients, therefore my little computer project is sentient—this is getting absurd.

[-]

og_kalu 3 months ago

Well you can say it doesn't understand, but then you don't have a very useful definition of the word.

You can say this is not 'real' understanding but you like many others will be unable to clearly distinguish this 'fake' understanding from 'real' understanding in a verifiable fashion, so you are just playing a game of meaningless semantics.

You really should think about what kind of difference is supposedly so important yet will not manifest itself in any testable way - an invented one.

nativeit 3 months ago

Sorry, this is more about the discussion of this article than the article itself. The moving goal posts that acolytes use to declare consciousness are becoming increasingly cult-y.

[-]

drodgers 3 months ago

Who cares about consciousness? This is just a mis-direction of the discussion. Ditto for 'intelligence' and 'understanding'.

Let's talk about what they can do and where that's trending.

wongarsu 3 months ago

We spent 40 years moving the goal posts on what constitutes AI. Now we seem to have found an AI worthy of that title and instead start moving the goal posts on "consciousness", "understanding" and "intelligence".

[-]

bluefirebrand 3 months ago

> Now we seem to have found an AI worthy of that title and instead start moving the goal posts on "consciousness", "understanding" and "intelligence".

We didn't "find" AI, we invented systems that some people want to call AI, and some people aren't convinced it meets the bar

It is entirely reasonable for people to realize we set the bar too low when it is a bar we invented

[-]

darkerside 3 months ago

What should the bar be? Should it be higher than it is for the average human? Or even the least intelligent human?

[-]

bluefirebrand 3 months ago

Personally I don't care what the bar is, honestly

Call it AI, call it LLMs, whatever

Just as long as we continue to recognize that it is a tool that humans can use, and don't start trying to treat it as a human, or as a life, and I won't complain

I'm saving my anger for when idiots start to argue that LLMs are alive and deserve human rights

joe8756438 3 months ago

there is no such bar.

We don’t even have a good way to quantify human ability. The idea that we could suddenly develop a technique to quantify human ability because we now have a piece of technology that would benefit from that quantification is absurd.

That doesn’t mean we shouldn’t try to measure the ability of an LLM. But it does mean that the techniques used to quantify an LLMs ability are not something that can be applied to humans outside of narrow focus areas.

cayley_graph 3 months ago

Indeed, science is a process of discovery and adjusting goals and expectations. It is not a mountain to be summited. It is highly telling that the LLM boosters do not understand this. Those with a genuine interest in pushing forward our understanding of cognition do.

[-]

delusional 3 months ago

They believe that once they reach this summit everything else will be trivial problems that can be posed to the almighty AI. It's not that they don't understand the process, it's that they think AI is going to disrupt that process.

They literally believe that the AI will supersede the scientific process. It's crypto shit all over again.

[-]

redundantly 3 months ago

Well, if that summit were reached and AI is able to improve itself trivially, I'd be willing to cede that they've reached their goal.

Anything less than that, meh.

arkh 3 months ago

The original meaning of mechanical Turk is about a chess hoax and how it managed to make people think it was a thinking machine. https://en.wikipedia.org/wiki/Mechanical_Turk

The current LLM anthropomorphism may soon be known as the silicon Turk. Managing to make people think they're AI.

[-]

6510 3 months ago

The mechanical Turk did something truly magical. Everyone stopped moaning that automation was impossible because most machines (while some absurdly complex) were many orders of magnitude simpler than chess.

The initial LLMs simply lied about everything. If you happened to know something it was rather shocking but for topics you knew nothing about you got a rather convincing answer. Then the arms race begun and now the lies are so convincing we are at viable robot overlords.

acchow 3 months ago

> Now we seem to have found an AI worthy of that title and instead start moving the goal posts on "consciousness"

The goalposts already differentiated between "totally human-like" vs "actually conscious"

See also Philosophical Zombie thought experiment from the 70s.

Sohcahtoa82 3 months ago

> We spent 40 years moving the goal posts on what constitutes AI.

Who is "we"?

I think of "AI" as a pretty all-encompassing term. ChatGPT is AI, but so is the computer player in the 1995 game Command and Conquer, among thousands of other games. Heck, I might even call the ghosts in Pac-man "AI", even if their behavior is extremely simple, predictable, and even exploitable once you understand it.

goatlover 3 months ago

Those have all been difficult words to define with much debate over that past 40 years or longer.

6510 3 months ago

My joke was that the what it cant do debate changed into what it shouldn't be allowed to.

[-]

wizardforhire 3 months ago

There ARE no jokes aloud on hn.

Look I’m no stranger to love. you know the rules and so do I… you can’t find this conversation with any other guy.

But since the parent was making a meta commentary on this conversation I’d like to introduce everyone here as Kettle to a friend of mine known as #000000

[-]

6510 3 months ago

Is there reason to think language developers understand nullability?

3 months ago

[deleted]

3 months ago

[deleted]

casenmgreen 3 months ago

LLMs do understand nothing.

They are not reasoning.