Numerical Precision Affects Mathematical Reasoning Capabilities of LLMs

(arxiv.org)

66 points | by belter 9 months ago ago

52 comments

thrance 8 months ago

By "Mathematical Reasoning Capabilities" they meant "basic arithmetic" and by "Numerical Precision" they meant "quantization".

[-]

Vetch 8 months ago

That's not quite right. By numerical precision they mean numerical precision, of which quantization is one method to arrive at reduced precision. They also perform experiments where they train from scratch for float32 and float16.

Reasoning, they should have replaced with: iterative computations with accumulating state. This paper, on the impact of quantization, is actually a lot more significant than it appears at first and I think the authors could have done a better job of discussing the broader implications.

The paper's core (and unsurprising) argument is that low-precision arithmetic significantly limits the representational capacity of individual neurons. This forces the model to encode numerical values across multiple neurons to avoid overflow, particularly when storing intermediate computational results. This distributed representation, in turn, increases complexity and makes the model vulnerable to accumulating errors, particularly during iterative operations. It's unclear from my initial reading whether low-precision training (quantization-aware training) is necessary for the model to effectively learn and utilize these distributed representations, or if this capacity is inherent. Regardless, while QAT likely offers benefits, especially with larger numbers, the fundamental limitations of low precision computation persist.

Why not just use a calculator? For some of the same reason humans shouldn't be completely dependent on calculators. It's not just the ability to perform fermi estimates that's constrained but internal computations that require physical "intuition" or modeling the trajectories of physical systems, the ability to work with growing algebraic representations, relative numeric comparisons of large magnitudes (where the model does not internally switch to a favorable format--notice this is easier to do in-context), and representing iterative computation on complex logical chains and hierarchical structures.

Why do we not see this in practice? I contend that we do. There is a small but quite vocal contingent in every LLM forum who insist that quantization, even 8-bits, results in severe degradation in quality despite what most benchmarks say. It's quite likely that common tasks and most tests do not require iterative computations where accumulating state representations must be accurately tracked and these individuals are encountering some of the exceptions.

[-]

prideout 8 months ago

Very nice analysis, wish I could upvote your comment more than once.

alexvitkov 8 months ago

I wonder if we'll get better performance on arithmetic tasks if we let LLMs generate digits backwards, e.g. 12345+11111=<rev>65432</rev> where the rev tags are special tokens.

The reason being that there's less context needed to do it this way, for addition at every step there's only 3 digits that need to be considered and they're already in the token stream.

[-]

Vetch 8 months ago

It does and the paper mentions some papers that investigate similar strategies in its appendix, but the fundamental precision issues do not go away.

astrange 8 months ago

Since tokens are numbers, I'd like to see the numbers simply tokenized as themselves, or maybe an exponential Golomb code.

thesz 8 months ago

Most probably, something like 65432 is just one token.

[-]

alexvitkov 8 months ago

It's not, it wouldn't make sense to have 100,000 tokens just for the first 100,000 numbers. There's a playground [1] where you can see how LLMs tokenize a string.

12345678987654321 is tokenized on various models like so:

  GPT4                123-456-789-876-543-21
  GPT3                123-45-678-98-765-43-21
  Llama-2, Mistral    1-2-3-4-5-6-7-8-9-8-7-6-5-4-3-2-1

[1] https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...

[-]

Jerrrrrrry 8 months ago

Looks like number string parsing may be of enough importance to warrant token look-ahead recursive sub-parsing, then use the most "promising" token-ization; the one that generates the highest yielding probability tree following that number string.

magicalhippo 8 months ago

Just been dabbling with local models, and while the several models I've tried generates decent sentences while quantized, they suffered heavily in following instructions and picking up details.

So a larger model but fairly aggressively quantized could perform worse than a smaller variant of the model with just light quantization, even though the larger still used more memory in total.

I guess some of this is due to the models not being trained to the quantization levels I used. In any case, I say don't get blended by parameter count alone, compare performances.

hggigg 8 months ago

Isn't reasoning a little strong word? I mean finding evidence of any reasoning is the first point and I haven't seen anything that points to that.

(note that I mean structured reasoning to solve new problems here, not just the correct answer and steps for existing problems).

[-]

swatcoder 8 months ago

The field of AI likes to borrow pop psychology/philosophy language and use it for its own narrow jargon. It's been doing that for almost a century now. That's how it builds enthusiasm and raises funds, and also how it shoots itself in the foot and invites the AI winters that hollow it out again.

The field named itself that very way to in response to a early 20th century boom of interest (money) in "intelligence" as something quantifiable (and therefore potentially computable). Tons of money was flying around based on the idea that people could be quantitatively measured for work-suitability, and researchers in computation saw opportunity to get research funds to explore that in their own budding endeavor, so they adopted the language and made a very long bet that they might eventually figure it out.

At this point, there is largely consensus within the field that current models perform a process that the the field has collectively decided to call "reasoning" as part of their tradition of repurposing psychology and philosophy language as jargon.

There's no putting that cat back in the bag. It's a term of art within the field now. Frustratingly, it's easier to just accept that this is just a novel sense of the word than to try to reconcile it with senses that we might be familiar with from lay usage or from other fields.

[-]

astrange 8 months ago

> The field of AI likes to borrow pop psychology/philosophy language and use it for its own narrow jargon.

They also like to invent metaphors for how the brain works, and then forget that it's a metaphor and decide that humans either literally have one, that they should program one into AIs, or that AIs must have invented one on their own.

This is why you keep seeing them talk about "world models", which probably don't actually exist and which people don't actually need in order to live their life.

[-]

fragmede 8 months ago

I agree that some of the talk about neurons in LLMs can be a bit much, but don't you have a model in your head of how the world is composed of forces, and you can generally predict how things react? If you throw a ball up in the air, will it come down or just go up forever?

Or are we using different definitions for "world model"?

[-]

astrange 8 months ago

I think the words "a" and "model" in "a world model" are both bad.

People do that sometimes (for instance I was at the Ren Faire recently and aimed the bow when I tried archery), but people are limited by time and calories so must be energy-efficient, which means they can't afford to spend effort making a model that isn't necessary.

And what's necessary is very specific to the task or context you're in. For instance, you don't need to know what's in the garbage to throw it out. But you do need to know what's in the recycling to put it in the right bin.

d4mi3n 8 months ago

I find it apropos that this topic (disagreement about if LLMs can "reason") boils down to being an issue of one of the hard problems in computer science: naming things.

[-]

ben_w 8 months ago

And the linked topic, off by one errors.

One could also the hallucinated citations are null pointers.

At least cache invalidation is easier: the users can backtrack up a conversation as far as they need to in order to un-cache things from the context window.

krisoft 8 months ago

> finding evidence of any reasoning is the first point

Happy to oblige. Oxford Languages dictionary defines “reasoning” as “the action of thinking about something in a logical, sensible way.”

This was my prompt to chatgpt: “Hey chatgpt! On hacker news user hggigg said this: Isn't reasoning a little strong word? I mean finding evidence of any reasoning is the first point and I haven't seen anything that points to that.

Based on this comment what do you think is the likely attitude of hggigg about large language models? (Answer with a single sentence)”

This was the response: “hggigg likely expresses skepticism about large language models' ability to genuinely reason.”

That sounds like reasoning to me. It was an action of thinking about something in a logical, sensible way. The something (a snippet of your comment) was novel. I don’t know what you mean by “structured”. Does it provide you evidence of reasoning for you? (If not, what would?)

[-]

nyrikki 8 months ago

Modern English lexicography is descriptive and not prescriptive. The OED is simply insufficient to show much more than a particular meaning was common among some cohort.

ELIZA transcripts also use to pass as human even with experts in some contexts, but was clearly mimicry.

It is almost impossible to have real discussions on this topic without deciding on definitions beforehand.

Especially with the impressive results of modern LLMs, but they are still essentially pattern finding and matching.

If that meets your personal definition of 'reasoning' that is fine.

It is probably advisable to understand the limitations if you are going to monetize it.

Even stochastic parrots may have practical applications, but mimicry doesn't prove understanding.

One challenge is that modern LLMs are so huge that it is fairly difficult to find anything outside the corpus.

Look at OpenAI's GPT4 technical report and you will see that almost all of its performance on school tests is from pre-training.

That doesn't diminish the potential value, but it does point to the limits. If you have access to things that are reliably not in the corpus, or are likely to be lost with set shattering it is fairly easy to show it is still NLP without a real understanding of the tokens.

But that is my personal definition of 'reasoning', which obviously isn't the same as yours.

MattPalmer1086 8 months ago

You can call it reasoning if you like, but what is going on underneath is clearly not the same process that humans use. Both can produce a useful answer, to be sure.

Its also not a great example of reasoning, as it is so simple. It's matching the context to an answer in some high dimensional space. But there is no real deduction or inference going on, where new ideas are derived from the proposition.

[-]

krisoft 8 months ago

> Its also not a great example of reasoning, as it is so simple.

I agree with that. Admitedly this was the simplest thing i could think of. But it feels we are just moving the goal post here. “Any reasoning” is what was asked. Now you are asking “a bit more complicated reasoning”.

> there is no real deduction

It does in my opinion correctly deducts that the commenter is skeptical. That is a deduction. What does a “real deduction” mean?

Is there a sufficiently complicated prompt where you could test if an AI is reasoning according to your definition?

[-]

MattPalmer1086 8 months ago

I guess what is missing for me is that there is no verification of the truth of it's responses.

When a human "reasons" they are trying to take some information and find an answer that apparently fits. We don't just blurt out something that may or may not fit. Or at least, if we did we wouldn't call it reasoning.

That an LLM can still produce answers which are correct some of the time is amazing. I just think without the ability to step back and assess it's response, it doesn't hit the bar of reasoning for me.

astrange 8 months ago

There's a confusion between training and inference time here. Even if LLMs can't reason during inference time, they may be able to do it during pretraining and then you can query the existing reasoning during inference.

Inference is more like running an algorithm - in fact it is that (Church-Turing equivalence) - so it's up to you if that is reasoning or not.

hggigg 8 months ago

You asked it a question and it gave you a statistically correct sounding answer which you built a straw man on to justify your own opinion.

That is reasoning.

[-]

krisoft 8 months ago

Propose a different test then. :) the point of the excercise is not to answer if chatgpt can or cannot reason, but to better understand what we mean by “can reason”.

> which you built a straw man on to justify your own opinion.

I don’t remember talking down to you and i would appreciate if you would treat me with the same courtesy.

swatcoder 8 months ago

> It was an action of thinking about something in a logical, sensible way

If only it was so simple! Empirically, you streamed a short text into a black box system and received a short text out. The output text was related and coherent, and -- after the fact -- you assessed it to be logically, sensibly applicable given the input.

On your way to using that to exemplify "reasoning" you took for granted that there was "thinking", and that the "thinking" was both "logical" and "sensible".

Any number of systems might deliver the empirical result you measured, and many of them would involve none of that. LLM's are sophisticated and highly capable, but the discussion about what language to use for what they do isn't nearly so simple as you suggest.

[-]

og_kalu 8 months ago

>Any number of systems might deliver the empirical result you measured

Like what ? What exactly would give similar empirical results over the wide number of tests LLMs have been subjected to ?

>and many of them would involve none of that.

Oh? How would you know that?

Did I miss the breakthrough Intellisense-o-meter that can measure the intelligence flowing through a system ?

I guess I'm just not sure what point you're trying to make here. Empirically assessing the result after the fact is how we determine intelligence and reasoning in.. anything humans included.

How do you determine a piece of metal is real gold ? By comparing the results of a series of tests against the properties of gold you have outlined. The metal is a black box. You don't need to understand every interaction occuring in your tests to determine whether it is gold or not.

[-]

NateEag 8 months ago

> Empirically assessing the result after the fact is how we determine intelligence and reasoning in.. anything humans included.

It isn't, though. At least not exclusively, I'd argue not even primarily.

We look instead at how someone describes their thought process to see if it seems reasonable, whether they're using logically-valid reasoning forms, and whether their arguments bback up their claims.

It's common to say that someone's _reasoning_ is sound but their _conclusion_ is incorrect, due to <subtle observation they missed> or <invalid premise>.

This is why math teachers always tell students to show their work, and give out 0s for correct answers with no steps.

[-]

og_kalu 8 months ago

All of that is simply part of the "result" here though. Result in this discussion is the output of the black box, whether that's a machine or a human brain. I don't mean it in the "final answer" sense.

For example, all the conclusions the math teacher made are still ultimately from assessing the "result" of the student only.

swatcoder 8 months ago

> >Any number of systems might deliver the empirical result you measured

> Like what ? What exactly would give similar empirical results over the wide number of tests...

The GP gave a single trivial example and my reply was highlighting how that example didn't inform the question at hand in any way. That example might be just as well be exhibited by a lookup table, a lucky random output, a modified markov chain, a Chinese Box, a trivial small language model trained on the right corpus, etc.

That certain LLM's or chatbots might also be able to deliver on more sophisticated examples that those systems cannot is not what they or I were talking about here. It was a discussion about obviousness from trivial samples, and about the deeper semantic dependencies hidden the GP's definition. Such trivial samples don't hold up to scrutiny as their definition recurses down through concepts that are not at all demonstrated in these trivial cases. Their attempt at rhetoric just fails.

> Empirically assessing the result after the fact is how we determine intelligence and reasoning in.. anything humans included.

No. We project intelligence onto humans inductively by identifying our own experience as intelligence and assuming that other things that are almost identical to us are very likely to be working in the same way.

In recent decades, there's been growing acceptance that other systems approximately like us (animals of certain kinds) exhibit it to various degrees as well, but even this itself was a formally rejected concept during the early 20th century when "intelligence" first became treated as something measurable and quantifiable at all. The concept of intelligence is a wholly cultural one and its definition and applicability moves around over time.

Yet there are currently very few people who would apply the word to a lookup table, a random output, a Chinese Room, a modified markov chain, etc and a smattering-but-growing number of people who are comfortable applying it to LLM's and chatbots as we see them today. As this happens and its use expands, the sense of the word changes, as it has been doing for hundreds of years.

At this point, we mostly can't yet rely on chatbots to fulfill roles that require interesting human intelligence. They can pass some benchmark tests to greater or less degrees, but also happen to be excellent benchmark optimizers (that's the essence of their design), so it remains hard to know what that really means.

If and when they become reliable substitutes for human intelligence in general tasks, or become interestingly independent in the way we see certain animals, the prior-yet-modern senses of intelligence will be easier to apply. But we're not there yet.

[-]

og_kalu 8 months ago

>That example might be just as well be exhibited by a lookup table, a lucky random output, a modified markov chain, a Chinese Box, a trivial small language model trained on the right corpus, etc.

Basically No-one who argues LLMs are intelligent does so from the results of a single query. That is simply the example he used to demonstrate what he believes is an instance of reasoning.

And If i said, show me the lookup table or random string generator or simple markov chain etc that would reply correctly to any question like that, would you be able to do so ? No. because you're hinging on random chance being your savior. If there's one property of Intelligence everybody can agree on, it's that it works better than chance.

Small tangent but the Chinese room is a nonsensical thought experiment. Can you demonstrate which neuron/synapse/cell/atom etc "understands" English. No? then your brain is a "Chinese Room"

>No. We project intelligence onto humans inductively by identifying our own experience as intelligence and assuming that other things that are almost identical to us are very likely to be working in the same way.

Really? History would certainly seem to argue otherwise. Wonder why all those African slaves and Jews were missing this projection. Seems to me there isn't necessarily any logic to what properties we project on others.

Leaving that aside. That's not what i was getting at. Why are we tested in schools? Why do IQ tests still exist today? Because we attempt to measure reasoning, intelligence and understanding through tests. The fact of the matter is that this is how we assess intelligence today. And I'm not sure why you brought up what people centuries or decades ago thought of intelligence but for as long as intelligence has been something to gatekeep (a very very long time), tests have been used to assess it even if the form and shape of these tests have changed over thousands of years.

>At this point, we mostly can't yet rely on chatbots to fulfill roles that require interesting human intelligence.

So? I'm not arguing about whether LLMs are Humans. How many animals can fulfill roles requiring "human intelligence" never mind "interesting" human intelligence (lol at the goalposts. I suppose when that bar is cleared, we'll move on to "very interesting" human intelligence and restart this conversation).

d4mi3n 8 months ago

I agree with this sentiment and would remind everyone that LLMs are probabilistic models. Anything that isn't in the training data set will not be produced as an output. If you squint really hard, you could kinda say an LLM is just a fancy compression/decompression scheme.

That said, in addition to anthropomorphizing something that sounds like a person, I think a fascinating thing about these discussions are the variety of unstated opinions on what reasoning, thought, or awareness is. This whole topic would be a lot simpler to classify if we had a better understanding of how _we_ are able to reason, think, and be aware to the level of understanding we have of the math behind an LLM.

[-]

mannykannot 8 months ago

> Anything that isn't in the training data set will not be produced as an output.

Thinking back to some of the examples I have seen, it feels as though this is not so. In particular, I'm thinking of a series where ChatGPT was prompted to invent a toy language with a simple grammar and then translate sentences between that language and English [1]. It seems implausible that the outputs produced here were in the training data set.

Furthermore, given that LLMs are probabilistic models and produce their output stochastically, it does not seem surprising that they might produce output not in their training data.

I agree that we do not have a good understanding of what reasoning, thought, or awareness is. One of the questions I have been pondering lately is whether, when a possible solution or answer to some question pops into my mind, it is the result of an unconscious LLM-like process, but that can't be all of it; for one thing, I - unlike LLMs - have a limited ability to assess my ideas for plausibility without being prompted to do so.

[1] https://maximumeffort.substack.com/p/i-taught-chatgpt-to-inv...

ben_w 8 months ago

> remind everyone that LLMs are probabilistic models

The problem is that's overly reductive. You and I are also probabilistic.

That's not to say an AI must therefore be like us, as the entire universe and everything in it is also a probabilistic thanks to quantum mechanics.

Every proposed test of "real intelligence" I've seen in these threads has much the same issue, it's just that some of the tests count too many things as intelligent (e.g. a VHS tape and recorder can "experience input", record that as a "memory", and "recall" it later) or excludes all humans (e.g. by requiring the ability to violate Gödel's incompleteness theorem, a standard which surprisingly even comes up on this site).

I'm happy to describe even the most competent LLM as "dumb" in the IQ sense, even though I also say they're book-smart — because even last year's models were a demonstration proof that a brain the size of a rat's can, with 2.5 million years of subjective experience, reach the performance of a work placement student in almost every book oriented field at the same time.

og_kalu 8 months ago

>that LLMs are probabilistic models.

And the brain isn't ? How do you think you have such a seamlessly continuous view of reality ? The brain is very probabilistic.

>Anything that isn't in the training data set will not be produced as an output.

This is not a restriction of probabilistic models. And it's certainly not a restriction of SOTA LLMs you can test today.

>be aware to the level of understanding we have of the math behind an LLM.

We have very little understanding of the meaning of the computations in large ANNs

krisoft 8 months ago

> Anything that isn't in the training data set will not be produced as an output.

Do you believe my prompt and that output was in the training set verbatim?

wbl 8 months ago

And how do I know you think?

[-]

swatcoder 8 months ago

Are you sure you do? You're just responding to a block of text on an internet forum with trivial signup flow, known to be populated by bots of greater or less sophistication, as everywhere else on the internet.

d4mi3n 8 months ago

We don't! We've gotten into philosophy, which is always a blast to ponder over. Aside from measuring brain activity and making observations, all we really know is that awareness seems to be an emergent property of our biology.

That said, we _do_ know how probabilistic models work and we do know that a calculator doesn't think or have awareness in the way we consider ourselves to.

[-]

NateEag 8 months ago

> we do know that a calculator doesn't think or have awareness in the way we consider ourselves to.

If only this were true.

Some people take panpsychism seriously, and while it may sound ludicrous at first blush, it isn't actually unreasonable. Hard materialist reductionism may be true but no one's really given an irrefutable explanation for the oddity which is consciousness.

It's tempting to say "Things that don't respond to stimulus are obviously not self-aware", but see locked-in syndrome for a falsification of that claim.

8 months ago

[deleted]

ben_w 8 months ago

Depends on which definition of "reasoning" you prefer.

If you need reasoning to be a "conscious" process, then we've got another problem because there's about 40* different meanings of that, too.

* Wikipedia cites: Vimal RL, Sansthana DA (2010). "On the Quest of Defining Consciousness" (PDF). Mind and Matter. 8 (1): 93–121

but then gives a dead link, so I couldn't check it

ziofill 8 months ago

I agree. Reasoning should be robust to perturbations of the way in which the same problem is cast, but LLMs have problems with that.

[-]

krisoft 8 months ago

Does human reasoning survive that test? My reasoning for example is roboust to some extent but you can easily confuse me with appropriately (or innapriately) worded questions.

Not even talking about if you grab me by the ankle and swing me around with 10g acceleration i won’t even be able to answer the simplest questions. So clearly my reasoning is highly sensitive to certain perturbations.

[-]

d4mi3n 8 months ago

Not disagreeing with you, but a point to consider for reasoning is that we consider it a key cornerstone of sapience and sentience.

I think one could argue that a LLM could be considered sapient (able to solve problems) in a general sense, but probably not sentient (able to sustain undifferentiated consciousness, continue to absorb and apply information, etc).

Part of the difficulty of these conversations, though, is many of these definitions were made before our math was advanced enough to approximate any of these criteria. We humans have also been notorious for excluding other things that could be considered sapience (e.g. elephants, dolphins, corvids, etc).

In all cases, I think this is going to be a difficult topic to find consensus on until we better understand how our biology results in the emergent properties of our own sapience.

ziofill 8 months ago

I see what you are saying. I guess it all depends on how complex one makes that test. But arguably vanilla LLMs (0-shot, no CoT, no rewrite memory) are much worse than humans at this.

amelius 8 months ago

One thing I was wondering about regarding transformers, which perhaps someone more knowledgeable can explain: as far as I understand, the attention heads are essentially two-dimensional structures where values related to tokens are compared to each other in a matrix. Has anyone tried to generalize this and make the dimension of the attention-heads higher than two?

[-]

svachalek 8 months ago

I'm not really in this space but I like to read the papers and as I understand it, the typical dimensionality is far higher than 2. For example in the original "All you need is attention" paper, the example they give has 64 dimensions. They're vectors so even though they might be drawn as a matrix, each value represents a distance in a different dimension.

[-]

amelius 8 months ago

I'm talking about the matrices W_Q, W_K and W_V. My question is why these are matrices (tensors of dimension 2) and not tensors of a higher dimension than 2.

My thinking goes like: a matrix can represent a graph (each entry may correspond to an edge between two nodes), but e.g. a 3-dimensional tensor may correspond to a hypergraph where each entry is a 3-hyperedge, so you can not just talk about the relation between two tokens, but also about the relation between three tokens (in language this could be e.g. subject, object and indirect-object/dative).

8 months ago

[deleted]

Animats 8 months ago

The empirical results seem legit. Not sure about all those "theorems". Those are claims of function existence, not a demonstration that transformers actually work that way. Comments?

AIFounder 8 months ago

[dead]