27 comments

  • thrance 3 hours ago

    By "Mathematical Reasoning Capabilities" they meant "basic arithmetic" and by "Numerical Precision" they meant "quantization".

    • Vetch 34 minutes ago

      That's not quite right. By numerical precision they mean numerical precision, of which quantization is one method to arrive at reduced precision. They also perform experiments where they train from scratch for float32 and float16.

      Reasoning, they should have replaced with: iterative computations with accumulating state. This paper, on the impact of quantization, is actually a lot more significant than it appears at first and I think the authors could have done a better job of discussing the broader implications.

      The paper's core (and unsurprising) argument is that low-precision arithmetic significantly limits the representational capacity of individual neurons. This forces the model to encode numerical values across multiple neurons to avoid overflow, particularly when storing intermediate computational results. This distributed representation, in turn, increases complexity and makes the model vulnerable to accumulating errors, particularly during iterative operations. It's unclear from my initial reading whether low-precision training (quantization-aware training) is necessary for the model to effectively learn and utilize these distributed representations, or if this capacity is inherent. Regardless, while QAT likely offers benefits, especially with larger numbers, the fundamental limitations of low precision computation persist.

      Why not just use a calculator? For some of the same reason humans shouldn't be completely dependent on calculators. It's not just the ability to perform fermi estimates that's constrained but internal computations that require physical "intuition" or modeling the trajectories of physical systems, the ability to work with growing algebraic representations, relative numeric comparisons of large magnitudes (where the model does not internally switch to a favorable format--notice this is easier to do in-context), and representing iterative computation on complex logical chains and hierarchical structures.

      Why do we not see this in practice? I contend that we do. There is a small but quite vocal contingent in every LLM forum who insist that quantization, even 8-bits, results in severe degradation in quality despite what most benchmarks say. It's quite likely that common tasks and most tests do not require iterative computations where accumulating state representations must be accurately tracked and these individuals are encountering some of the exceptions.

  • alexvitkov 3 hours ago

    I wonder if we'll get better performance on arithmetic tasks if we let LLMs generate digits backwards, e.g. 12345+11111=<rev>65432</rev> where the rev tags are special tokens.

    The reason being that there's less context needed to do it this way, for addition at every step there's only 3 digits that need to be considered and they're already in the token stream.

    • Vetch 28 minutes ago

      It does and the paper mentions some papers that investigate similar strategies in its appendix, but the fundamental precision issues do not go away.

    • thesz 2 hours ago

      Most probably, something like 65432 is just one token.

      • alexvitkov an hour ago

        It's not, it wouldn't make sense to have 100,000 tokens just for the first 100,000 numbers. There's a playground [1] where you can see how LLMs tokenize a string.

        12345678987654321 is tokenized on various models like so:

          GPT4                123-456-789-876-543-21
          GPT3                123-45-678-98-765-43-21
          Llama-2, Mistral    1-2-3-4-5-6-7-8-9-8-7-6-5-4-3-2-1
        
        [1] https://huggingface.co/spaces/Xenova/the-tokenizer-playgroun...
  • magicalhippo 2 hours ago

    Just been dabbling with local models, and while the several models I've tried generates decent sentences while quantized, they suffered heavily in following instructions and picking up details.

    So a larger model but fairly aggressively quantized could perform worse than a smaller variant of the model with just light quantization, even though the larger still used more memory in total.

    I guess some of this is due to the models not being trained to the quantization levels I used. In any case, I say don't get blended by parameter count alone, compare performances.

  • Animats an hour ago

    The empirical results seem legit. Not sure about all those "theorems". Those are claims of function existence, not a demonstration that transformers actually work that way. Comments?

  • hggigg 3 hours ago

    Isn't reasoning a little strong word? I mean finding evidence of any reasoning is the first point and I haven't seen anything that points to that.

    (note that I mean structured reasoning to solve new problems here, not just the correct answer and steps for existing problems).

    • swatcoder 3 hours ago

      The field of AI likes to borrow pop psychology/philosophy language and use it for its own narrow jargon. It's been doing that for almost a century now. That's how it builds enthusiasm and raises funds, and also how it shoots itself in the foot and invites the AI winters that hollow it out again.

      The field named itself that very way to in response to a early 20th century boom of interest (money) in "intelligence" as something quantifiable (and therefore potentially computable). Tons of money was flying around based on the idea that people could be quantitatively measured for work-suitability, and researchers in computation saw opportunity to get research funds to explore that in their own budding endeavor, so they adopted the language and made a very long bet that they might eventually figure it out.

      At this point, there is largely consensus within the field that current models perform a process that the the field has collectively decided to call "reasoning" as part of their tradition of repurposing psychology and philosophy language as jargon.

      There's no putting that cat back in the bag. It's a term of art within the field now. Frustratingly, it's easier to just accept that this is just a novel sense of the word than to try to reconcile it with senses that we might be familiar with from lay usage or from other fields.

      • d4mi3n 2 hours ago

        I find it apropos that this topic (disagreement about if LLMs can "reason") boils down to being an issue of one of the hard problems in computer science: naming things.

    • krisoft 2 hours ago

      > finding evidence of any reasoning is the first point

      Happy to oblige. Oxford Languages dictionary defines “reasoning” as “the action of thinking about something in a logical, sensible way.”

      This was my prompt to chatgpt: “Hey chatgpt! On hacker news user hggigg said this: Isn't reasoning a little strong word? I mean finding evidence of any reasoning is the first point and I haven't seen anything that points to that.

      Based on this comment what do you think is the likely attitude of hggigg about large language models? (Answer with a single sentence)”

      This was the response: “hggigg likely expresses skepticism about large language models' ability to genuinely reason.”

      That sounds like reasoning to me. It was an action of thinking about something in a logical, sensible way. The something (a snippet of your comment) was novel. I don’t know what you mean by “structured”. Does it provide you evidence of reasoning for you? (If not, what would?)

      • nyrikki an hour ago

        Modern English lexicography is descriptive and not prescriptive. The OED is simply insufficient to show much more than a particular meaning was common among some cohort.

        ELIZA transcripts also use to pass as human even with experts in some contexts, but was clearly mimicry.

        It is almost impossible to have real discussions on this topic without deciding on definitions beforehand.

        Especially with the impressive results of modern LLMs, but they are still essentially pattern finding and matching.

        If that meets your personal definition of 'reasoning' that is fine.

        It is probably advisable to understand the limitations if you are going to monetize it.

        Even stochastic parrots may have practical applications, but mimicry doesn't prove understanding.

        One challenge is that modern LLMs are so huge that it is fairly difficult to find anything outside the corpus.

        Look at OpenAI's GPT4 technical report and you will see that almost all of its performance on school tests is from pre-training.

        That doesn't diminish the potential value, but it does point to the limits. If you have access to things that are reliably not in the corpus, or are likely to be lost with set shattering it is fairly easy to show it is still NLP without a real understanding of the tokens.

        But that is my personal definition of 'reasoning', which obviously isn't the same as yours.

      • swatcoder 2 hours ago

        > It was an action of thinking about something in a logical, sensible way

        If only it was so simple! Empirically, you streamed a short text into a black box system and received a short text out. The output text was related and coherent, and -- after the fact -- you assessed it to be logically, sensibly applicable given the input.

        On your way to using that to exemplify "reasoning" you took for granted that there was "thinking", and that the "thinking" was both "logical" and "sensible".

        Any number of systems might deliver the empirical result you measured, and many of them would involve none of that. LLM's are sophisticated and highly capable, but the discussion about what language to use for what they do isn't nearly so simple as you suggest.

        • og_kalu an hour ago

          >Any number of systems might deliver the empirical result you measured

          Like what ? What exactly would give similar empirical results over the wide number of tests LLMs have been subjected to ?

          >and many of them would involve none of that.

          Oh? How would you know that?

          Did I miss the breakthrough Intellisense-o-meter that can measure the intelligence flowing through a system ?

          I guess I'm just not sure what point you're trying to make here. Empirically assessing the result after the fact is how we determine intelligence and reasoning in.. anything humans included.

          How do you determine a piece of metal is real gold ? By comparing the results of a series of tests against the properties of gold you have outlined. The metal is a black box. You don't need to understand every interaction occuring in your tests to determine whether it is gold or not.

          • swatcoder 27 minutes ago

            > >Any number of systems might deliver the empirical result you measured

            > Like what ? What exactly would give similar empirical results over the wide number of tests...

            The GP gave a single trivial example and my reply was highlighting how that example didn't inform the question at hand in any way. That example might be just as well be exhibited by a lookup table, a lucky random output, a modified markov chain, a Chinese Box, a trivial small language model trained on the right corpus, etc.

            That certain LLM's or chatbots might also be able to deliver on more sophisticated examples that those systems cannot is not what they or I were talking about here. It was a discussion about obviousness from trivial samples, and about the deeper semantic dependencies hidden the GP's definition. Such trivial samples don't hold up to scrutiny as their definition recurses down through concepts that are not at all demonstrated in these trivial cases. Their attempt at rhetoric just fails.

            > Empirically assessing the result after the fact is how we determine intelligence and reasoning in.. anything humans included.

            No. We project intelligence onto humans inductively by identifying our own experience as intelligence and assuming that other things that are almost identical to us are very likely to be working in the same way.

            In recent decades, there's been growing acceptance that other systems approximately like us (animals of certain kinds) exhibit it to various degrees as well, but even this itself was a formally rejected concept during the early 20th century when "intelligence" first became treated as something measurable and quantifiable at all. The concept of intelligence is a wholly cultural one and its definition and applicability moves around over time.

            Yet there are currently very few people who would apply the word to a lookup table, a random output, a Chinese Room, a modified markov chain, etc and a smattering-but-growing number of people who are comfortable applying it to LLM's and chatbots as we see them today. As this happens and its use expands, the sense of the word changes, as it has been doing for hundreds of years.

            At this point, we mostly can't yet rely on chatbots to fulfill roles that require interesting human intelligence. They can pass some benchmark tests to greater or less degrees, but also happen to be excellent benchmark optimizers (that's the essence of their design), so it remains hard to know what that really means.

            If and when they become reliable substitutes for human intelligence in general tasks, or become interestingly independent in the way we see certain animals, the prior-yet-modern senses of intelligence will be easier to apply. But we're not there yet.

        • d4mi3n 2 hours ago

          I agree with this sentiment and would remind everyone that LLMs are probabilistic models. Anything that isn't in the training data set will not be produced as an output. If you squint really hard, you could kinda say an LLM is just a fancy compression/decompression scheme.

          That said, in addition to anthropomorphizing something that sounds like a person, I think a fascinating thing about these discussions are the variety of unstated opinions on what reasoning, thought, or awareness is. This whole topic would be a lot simpler to classify if we had a better understanding of how _we_ are able to reason, think, and be aware to the level of understanding we have of the math behind an LLM.

          • mannykannot an hour ago

            > Anything that isn't in the training data set will not be produced as an output.

            Thinking back to some of the examples I have seen, it feels as though this is not so. In particular, I'm thinking of a series where ChatGPT was prompted to invent a toy language with a simple grammar and then translate sentences between that language and English [1]. It seems implausible that the outputs produced here were in the training data set.

            Furthermore, given that LLMs are probabilistic models and produce their output stochastically, it does not seem surprising that they might produce output not in their training data.

            I agree that we do not have a good understanding of what reasoning, thought, or awareness is. One of the questions I have been pondering lately is whether, when a possible solution or answer to some question pops into my mind, it is the result of an unconscious LLM-like process, but that can't be all of it; for one thing, I - unlike LLMs - have a limited ability to assess my ideas for plausibility without being prompted to do so.

            [1] https://maximumeffort.substack.com/p/i-taught-chatgpt-to-inv...

          • og_kalu an hour ago

            >that LLMs are probabilistic models.

            And the brain isn't ? How do you think you have such a seamlessly continuous view of reality ? The brain is very probabilistic.

            >Anything that isn't in the training data set will not be produced as an output.

            This is not a restriction of probabilistic models. And it's certainly not a restriction of SOTA LLMs you can test today.

            >be aware to the level of understanding we have of the math behind an LLM.

            We have very little understanding of the meaning of the computations in large ANNs

        • wbl 2 hours ago

          And how do I know you think?

          • swatcoder 2 hours ago

            Are you sure you do? You're just responding to a block of text on an internet forum with trivial signup flow, known to be populated by bots of greater or less sophistication, as everywhere else on the internet.

          • d4mi3n 2 hours ago

            We don't! We've gotten into philosophy, which is always a blast to ponder over. Aside from measuring brain activity and making observations, all we really know is that awareness seems to be an emergent property of our biology.

            That said, we _do_ know how probabilistic models work and we do know that a calculator doesn't think or have awareness in the way we consider ourselves to.

      • MattPalmer1086 2 hours ago

        You can call it reasoning if you like, but what is going on underneath is clearly not the same process that humans use. Both can produce a useful answer, to be sure.

        Its also not a great example of reasoning, as it is so simple. It's matching the context to an answer in some high dimensional space. But there is no real deduction or inference going on, where new ideas are derived from the proposition.

    • ben_w 3 hours ago

      Depends on which definition of "reasoning" you prefer.

      If you need reasoning to be a "conscious" process, then we've got another problem because there's about 40* different meanings of that, too.

      * Wikipedia cites: Vimal RL, Sansthana DA (2010). "On the Quest of Defining Consciousness" (PDF). Mind and Matter. 8 (1): 93–121

      but then gives a dead link, so I couldn't check it

    • ziofill 3 hours ago

      I agree. Reasoning should be robust to perturbations of the way in which the same problem is cast, but LLMs have problems with that.

      • krisoft 2 hours ago

        Does human reasoning survive that test? My reasoning for example is roboust to some extent but you can easily confuse me with appropriately (or innapriately) worded questions.

        Not even talking about if you grab me by the ankle and swing me around with 10g acceleration i won’t even be able to answer the simplest questions. So clearly my reasoning is highly sensitive to certain perturbations.

        • d4mi3n 2 hours ago

          Not disagreeing with you, but a point to consider for reasoning is that we consider it a key cornerstone of sapience and sentience.

          I think one could argue that a LLM could be considered sapient (able to solve problems) in a general sense, but probably not sentient (able to sustain undifferentiated consciousness, continue to absorb and apply information, etc).

          Part of the difficulty of these conversations, though, is many of these definitions were made before our math was advanced enough to approximate any of these criteria. We humans have also been notorious for excluding other things that could be considered sapience (e.g. elephants, dolphins, corvids, etc).

          In all cases, I think this is going to be a difficult topic to find consensus on until we better understand how our biology results in the emergent properties of our own sapience.