Numbers aren't language, or even sequences of tokens, or vectors.
There is an inherent numeric-ness and logic to math that I don't think we can represent well using LLMs and transformers.
3 isn't about the word "three" - it is a quantity or a measurement. And 3x4 is a specific numerical operation that is not really contained in that sequence of symbols.
The amount of paths in the wrong direction are infinitely more than then number in the right direction. You'll quickly realize this doesn't actually scale.
Would love to see an architecture that learned more like humans. Start with just imitating one letter, then a few more, than some syllables, then full words, then sentences, etc. Progressively adding on top of previous knowledge
Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate
Are there training methods/curriculums that explicitly maximize generalization?
Yes, I also wonder about this! Progress from children books to scientific papers etc.
Could it learn e.g. language structure faster in a pre-training stage?
Also somehow one needs to define a proxy to generalization to compute a loss and do backpropagation.
That’s a very interesting take. I hadn’t really considered evolution
I guess if you really wanted to start from scratch, you could figure out how to evolve the whole system from a single cell or something like that. In some ways neural networks have kind of evolved in that way, assisted by humans. They started with a single perceptron, and have gone all the way to deep learning and convolutional networks
I also remember a long time ago studying genetic and evolutionary algorithms, but they were pretty basic in terms of what they could learn and do, compared to modern LLMs
Although recently I saw some research in which they were applying essentially genetic algorithms to merge model weights and produce models with new/evolved capabilities
The chains-of-thought here are artificially constructed, very information-dense partial sums formatted in a specific way that guides the fine tuning. A potential next step would be to look at real-world chains-of-thought and see whether some process could start with those and achieve the same result. Then you could really have a self-improving system!
Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?
But it's very hard to define "real-world CoT" -- think about human, we learn multiplications by vertical calculation and we learn division in a similar way -- all these learning process requires an "information dense" tools (calculation process) with intrinsic math rules in it. Isn't that an adapted way of CoT?
The paper uses a number representation that is designed to make attention easy to learn: each digit is a separate token and the least significant digit is put first, so that the first digit of the output is simply the sum of the first digits of the inputs and the second digit is the sum of the second digits plus an optional carry from the first digits and so on.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
The paper is about the ability of transformers to learn a task based on training data for that task only, not about LLMs pretrained on much of the internet. And training on log tables doesn't necessarily allow the model to always output the correct logarithm, just as training on multiplication tables doesn't necessarily confer the ability to multiply.
I tried to ask a model to tell me what is the "long multiplication algorithm". It gave it to me. I asked it to follow that algorithm to solve eg. 12987318927 * 12098102983, and it followed the algorithm, and it got the right answer. It DOES fail more when the numbers are longer (because it results with more text in the context), but that can be improved by having the model focus on the right subset of the text, right?
Declaring victory on "reasoning" based on cherry-picking a correct result about arithmetic is, of course, very narrow and absurdly optimistic. Even if it correctly works for all NxM calculations. Moving on from arithmetic to any kind of problem that fundamentally reduces to model-checking behind the scenes.. we would be talking about exploring a state-space with potentially many thousands of state-transitions for simple stuff. If each one even has a small chance of crapping out due to hallucination, the chance of encountering errors at the macro-scale is going to be practically guaranteed.
Everyone will say, "but you want tool-use or code-gen for this anyway". Sure! But carry-digits or similar is just one version of "correct matters" and putting some non-local kinds of demands on attention, plus it's easier to check than code. So tool-use or code-gen is just pushing the same problem somewhere else to hide it.. there's still a lot of steps involved, and each one really has to be correct if the macro-layer is going to be correct and the whole thing is going to be hands-off / actually automated. Maybe that's why local-models can still barely handle nontrivial tool-calling.
Well, if the model can reliably keep in context CPU cache plus CPU registers plus CPU instructions and is able to do operations based on those, then we pretty much solved computation using LLMs, right? It could use RAG to operate on RAM and SSD.
Here we can see the amount of data a high end traditional non-SOC CPU holds:
> For a recent high-end non-SoC desktop CPU:
> Cache: ~40-100 MB total (L1 + L2 + shared L3)
> Register files: tens to few hundreds of KB total across cores (e.g., ~200-300 KB or so)
> Combined: So you're looking at ~40-100 MB + ~0.2 MB → roughly ~40-100 MB of total on-chip caches + registers.
I'm sure we can reduce these caches to fit in the context windows of today's LLMs (~500,000 tokens).
Then, with temperature 0 we get more "discrete" operations. Now, we still have the rare problem of hallucinations, but it should be small with temperature 0.
It doesn't work like mapping CPU caches/registers into an LLM context. Transformers have no mutable registers, they attend over past tokens and can't update prior state. RAG isn't RAM. Even with huge context, you still can't step CPU style instructions without an external, read/write memory/tooling.
And temperature 0 makes outputs deterministic, not magically correct.
> And temperature 0 makes outputs deterministic, not magically correct.
For reasons I don't claim to really understand, I don't think it even makes them deterministic. Floating point something something? I'm not sure temperature even has a static technical definition or implementation everywhere at this point. I've been ignoring temperature and using nucleus sampling anywhere that's exposed and it seems to work better.
Random but typical example.. pydantic-ai has a caveat that doesn't reference any particular model: "Note that even with temperature of 0.0, the results will not be fully deterministic". And of course this is just the very bottom layer of model-config and in a system of diverse agents using different frameworks and models, it's even worse.
Well, the LLM may re-infer the whole state fully on every instruction. Temperature 0 is deterministic and that's what we are looking for. If the model is trained properly on how the CPU state + instructions should be handled, then it should be able to produce the next state.
If being probabilistic prevented learning deterministic functions, transformers couldn’t learn addition either. But they can, so that can't be the reason.
Not true though. Internally they can “shell out” to sub-tasks that know how to do specific things. The specific things don’t have to be models.
(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).
This is a gut impression and I don't deny it, but LLMs are Large Language Models, and in my own brain, my Language Model isn't doing large-scale multiplication. I have a language-based intuition for the sigle-digit multiplication table and a touch beyond (and based on my observations that's already above average for a human Language Model, at least in my age peer group), but it's not my Language Model doing 283 times 9284. That requires a symbolic manipulation model, and in fact I would observe that my personal neural net, for all the things it is amazingly good at, is in fact quite terrible at that sort of multiplication too. A Commodore PET is by all measures vastly, vastly simpler than my brain, but it blows away my multiplication capabilities. And then the symbolic systems tacked on another, what, 15 orders of magnitude from that "blows away my multiplication capabilities"? Depends on how you count, but something like that.
You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.
It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".
I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.
I agree with you, seems like we are trying to make the shoe fit. Not only are we missing the understanding of what is happening inside transformers, but now we are trying to teach them and see how they respond and then interpret it. That seems fine with viruses and animals, but we are talking about a piece of software here. Shouldn't we know what's happening inside? Maybe these kinds of papers can shine more light and give us better understanding though, still it feels backwards to me...Regarding the multiplication itself, shouldn't pure understanding of the meaning of multiplication(it's a summation basically) be enough for 'AI' to call it a day? If AI or human understands that, then the rest is computation part. We already got that covered, so instead of having 'AI' learn it on its own on crazy amount of data and get it right 99% of time, shouldn't we just give it a calculator? Somebody PLEEAASE give this AI a calculator :-)
A lot of savants that are able to do really cool calculations, or even people that have synesthesia seeing numbers as colors, don't actually do "real" calculations.
I think most humans that do math aren't actually literally computing things as some kind of logic machine.
We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.
True. Generally it seems like you're visualizing things, moving stuff around, seeing vague patterns and trying to make them more clear. IDK how a transformer architecture would fit all of that in its context, or use it productivity once it's there. You can't just keep appending forever, but you also can't delete stuff either, because unlike humans, a deletion is a hard delete; there's no fuzzy remembrance left to rely on, so even deleting bad ideas is dangerous because it'll forget that it was a bad idea and infinite loop. Symbols manipulation doesn't come until the end, after you have a good idea what that part will look like.
Hmm, I wonder what happens if you let them manipulate their own context symbolically, maybe something like a stack machine. Perhaps all you need is a "delete" token, or a "replace" flag. That way you don't have context full of irrelevant information.
I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.
Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.
There's equivocation in that statement, though, whether you meant there to be or not. There is clearly a difference in how we manipulate English words for normal human activities and the symbolic manipulation with very strict rules we today associate with mathematics and computer science. Human language goes back thousands of years, into the indefinite past we can't track past. Symbolic manipulation is a much, much more recent development, starting only ~2300 years ago around Euclid and not really coming into full development until much later... you can argue about exactly when it is but I'd personally put it as late as the 19th century for it to be recognized in the modern sense. It must be something different if separated by that many centuries.
To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.
Doing multiplication in your head isn't the point though, you can externalise language and use it to do things you can't do in your head by writing it down.
Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.
I think it should be able to learn multiplication with chain of thought. Without it, it's probably really difficult to generalize the multiplication of two n-digit integers when you have to accumulate up to n products of digits and handle carrying for each output digit.
That's very cool, but it's not an apples to apples comparison. The reasoning model learned how to do long multiplication. (Either from the internet, or from generated examples of long multiplication that were used to sharpen its reasoning skills. In principle, it might have invented it on its own during RL, but no, I don't think so.)
In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.
Numbers aren't language, or even sequences of tokens, or vectors.
There is an inherent numeric-ness and logic to math that I don't think we can represent well using LLMs and transformers.
3 isn't about the word "three" - it is a quantity or a measurement. And 3x4 is a specific numerical operation that is not really contained in that sequence of symbols.
A while back I saw a post where people ran a model over and over to accomplish a code base port from one language to another.
In their prompt, they told it to leave itself a note and to accomplish something each time.
Then they put the model in a loop and it worked. In one instance, a model removed itself from the loop by editing a file or some other basic means.
To me, iterative tasks like like multiply and long divide, look an awful lot like the code port experiment.
Putting models into loops so they get more than one bite at the task seems to be a logical progression to improve capability.
The amount of paths in the wrong direction are infinitely more than then number in the right direction. You'll quickly realize this doesn't actually scale.
Would love to see an architecture that learned more like humans. Start with just imitating one letter, then a few more, than some syllables, then full words, then sentences, etc. Progressively adding on top of previous knowledge
Also, it’s interesting that one of the big goals/measures of models is their capacity to “generalize”, but the training methods optimize for loss/accuracy, and only after training test for generalization to validate
Are there training methods/curriculums that explicitly maximize generalization?
Yes, I also wonder about this! Progress from children books to scientific papers etc. Could it learn e.g. language structure faster in a pre-training stage? Also somehow one needs to define a proxy to generalization to compute a loss and do backpropagation.
This field of study is known as "Curriculum Learning" for your Googling pleasure (or I guess ChatGPT Deep Research now).
"an architecture that learned more like humans"
i.e. enduring countless generations of evolutionary selection and cross breeding, then fine-tuning a bit?
although it could be interesting, i don't think training on progressively complex strings entirely recapitulates this.
That’s a very interesting take. I hadn’t really considered evolution
I guess if you really wanted to start from scratch, you could figure out how to evolve the whole system from a single cell or something like that. In some ways neural networks have kind of evolved in that way, assisted by humans. They started with a single perceptron, and have gone all the way to deep learning and convolutional networks
I also remember a long time ago studying genetic and evolutionary algorithms, but they were pretty basic in terms of what they could learn and do, compared to modern LLMs
Although recently I saw some research in which they were applying essentially genetic algorithms to merge model weights and produce models with new/evolved capabilities
Given their names I'd say they're too busy optimising primes...
Take your damned upvote, and go away.
The chains-of-thought here are artificially constructed, very information-dense partial sums formatted in a specific way that guides the fine tuning. A potential next step would be to look at real-world chains-of-thought and see whether some process could start with those and achieve the same result. Then you could really have a self-improving system!
Also I wonder if the LLM "knows" that it has this capability after fine-tuning. If it encounters multiplication as part of some larger chain-of-thought, will it solve that internally, or will it continue to do it step-by-step in the chain-of-thought?
But it's very hard to define "real-world CoT" -- think about human, we learn multiplications by vertical calculation and we learn division in a similar way -- all these learning process requires an "information dense" tools (calculation process) with intrinsic math rules in it. Isn't that an adapted way of CoT?
They're not any better at addition, are they? If they are, I wonder how good they are at adding numbers in log space.
The paper uses a number representation that is designed to make attention easy to learn: each digit is a separate token and the least significant digit is put first, so that the first digit of the output is simply the sum of the first digits of the inputs and the second digit is the sum of the second digits plus an optional carry from the first digits and so on.
If the numbers are represented with the most significant digit first as usual, you need a bunch of intermediate steps before outputting even the first digit just to determine whether it is affected by a carry or not.
The paper looks at multiplication of numbers represented with the least significant digit first as a toy task requiring several additions as intermediate steps to study why a model large enough to perform those additions in principle fails to learn to do so in practice.
They compare with a model that is first trained to produce the intermediate additions explicitly (as a "chain of thought" with a specific format) and then has this CoT progressively shortened during training until there's nothing left of it. But that second model successfully multiplies.
The difference appears to be that the presence of the intermediate results induces a better number representation in latent space, whereas the model without CoT gets stuck in a less efficient local minimum.
So the answer to the question "Why can't transformers learn multiplication?" is that the training process is insufficient for the model to discover the best intermediate steps on its own.
You could do a similar experiment where the CoT involves first taking the logarithm, adding, and then exponentiating to get the final result, but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
> but I think logarithms are probably another computation that's too difficult to learn without additional hints for intermediate steps.
I suppose you're probably right, but LLMs probably have a lot of log tables in their training data so I'm not so sure.
The paper is about the ability of transformers to learn a task based on training data for that task only, not about LLMs pretrained on much of the internet. And training on log tables doesn't necessarily allow the model to always output the correct logarithm, just as training on multiplication tables doesn't necessarily confer the ability to multiply.
I tried to ask a model to tell me what is the "long multiplication algorithm". It gave it to me. I asked it to follow that algorithm to solve eg. 12987318927 * 12098102983, and it followed the algorithm, and it got the right answer. It DOES fail more when the numbers are longer (because it results with more text in the context), but that can be improved by having the model focus on the right subset of the text, right?
> It DOES fail more when the numbers are longer (because it results with more text in the context),
I tried to raise this question yesterday. https://news.ycombinator.com/item?id=45683113#45687769
Declaring victory on "reasoning" based on cherry-picking a correct result about arithmetic is, of course, very narrow and absurdly optimistic. Even if it correctly works for all NxM calculations. Moving on from arithmetic to any kind of problem that fundamentally reduces to model-checking behind the scenes.. we would be talking about exploring a state-space with potentially many thousands of state-transitions for simple stuff. If each one even has a small chance of crapping out due to hallucination, the chance of encountering errors at the macro-scale is going to be practically guaranteed.
Everyone will say, "but you want tool-use or code-gen for this anyway". Sure! But carry-digits or similar is just one version of "correct matters" and putting some non-local kinds of demands on attention, plus it's easier to check than code. So tool-use or code-gen is just pushing the same problem somewhere else to hide it.. there's still a lot of steps involved, and each one really has to be correct if the macro-layer is going to be correct and the whole thing is going to be hands-off / actually automated. Maybe that's why local-models can still barely handle nontrivial tool-calling.
Well, if the model can reliably keep in context CPU cache plus CPU registers plus CPU instructions and is able to do operations based on those, then we pretty much solved computation using LLMs, right? It could use RAG to operate on RAM and SSD.
Here we can see the amount of data a high end traditional non-SOC CPU holds:
> For a recent high-end non-SoC desktop CPU: > Cache: ~40-100 MB total (L1 + L2 + shared L3) > Register files: tens to few hundreds of KB total across cores (e.g., ~200-300 KB or so) > Combined: So you're looking at ~40-100 MB + ~0.2 MB → roughly ~40-100 MB of total on-chip caches + registers.
I'm sure we can reduce these caches to fit in the context windows of today's LLMs (~500,000 tokens).
Then, with temperature 0 we get more "discrete" operations. Now, we still have the rare problem of hallucinations, but it should be small with temperature 0.
It doesn't work like mapping CPU caches/registers into an LLM context. Transformers have no mutable registers, they attend over past tokens and can't update prior state. RAG isn't RAM. Even with huge context, you still can't step CPU style instructions without an external, read/write memory/tooling.
And temperature 0 makes outputs deterministic, not magically correct.
> And temperature 0 makes outputs deterministic, not magically correct.
For reasons I don't claim to really understand, I don't think it even makes them deterministic. Floating point something something? I'm not sure temperature even has a static technical definition or implementation everywhere at this point. I've been ignoring temperature and using nucleus sampling anywhere that's exposed and it seems to work better.
Random but typical example.. pydantic-ai has a caveat that doesn't reference any particular model: "Note that even with temperature of 0.0, the results will not be fully deterministic". And of course this is just the very bottom layer of model-config and in a system of diverse agents using different frameworks and models, it's even worse.
Well, the LLM may re-infer the whole state fully on every instruction. Temperature 0 is deterministic and that's what we are looking for. If the model is trained properly on how the CPU state + instructions should be handled, then it should be able to produce the next state.
Because they produce output probabilistically, when multiplication is deterministic. Why is this so hard for everyone?
If being probabilistic prevented learning deterministic functions, transformers couldn’t learn addition either. But they can, so that can't be the reason.
People are probabilistic, and I've been informed that people are able to perform multiplication.
Yes, and unlike the LLM they can iterate on a problem.
When I multiply, I take it in chunks.
Put the LLM into a loop, instruct it to keep track of where it is and have it solve a digit at a time.
I bet it does just fine. See my other comment as to why I think that is.
Not true though. Internally they can “shell out” to sub-tasks that know how to do specific things. The specific things don’t have to be models.
(I’m specifically talking about commercial hosted ones that have the capability i describe - obviously your run of the mill one downloaded off of the internet cannot do this).
yes, what your describing is not a transformer but a high-level LLM-based product with tool-calling wired up to it
That doesn't appear to be the kind of thing this article is describing.
This is a gut impression and I don't deny it, but LLMs are Large Language Models, and in my own brain, my Language Model isn't doing large-scale multiplication. I have a language-based intuition for the sigle-digit multiplication table and a touch beyond (and based on my observations that's already above average for a human Language Model, at least in my age peer group), but it's not my Language Model doing 283 times 9284. That requires a symbolic manipulation model, and in fact I would observe that my personal neural net, for all the things it is amazingly good at, is in fact quite terrible at that sort of multiplication too. A Commodore PET is by all measures vastly, vastly simpler than my brain, but it blows away my multiplication capabilities. And then the symbolic systems tacked on another, what, 15 orders of magnitude from that "blows away my multiplication capabilities"? Depends on how you count, but something like that.
You can sit here and force me to recite ("train me on") multi-digit multiplication problems and their result until the day I die, and my language model is only going to get marginally better. It is in practicing my symbolic manipulation that I'm going to get better and faster.
It seems to me that expecting a Language Model to be very good at multiplication is asking for a substantially superhuman level of performance from them, and one that we have little reason to believe will scale anyhow. What we need is symbolic manipulation, better than the approximation they achieve when "reasoning".
I find it rather ironic to sit here and use the aforementioned 15 orders of magnitude improvement over the Commodore PET to use that level of symbolic manipulation firepower to laboriously recreate a software system that is as bad as we are at multiplication for what may well be the same fundamental reasons... and then have the audacity to complain about it. My metaphorical dude, you did a couple trillion multiplications just to get to this single bad multiplication output... maybe another approach is called for.
I agree with you, seems like we are trying to make the shoe fit. Not only are we missing the understanding of what is happening inside transformers, but now we are trying to teach them and see how they respond and then interpret it. That seems fine with viruses and animals, but we are talking about a piece of software here. Shouldn't we know what's happening inside? Maybe these kinds of papers can shine more light and give us better understanding though, still it feels backwards to me...Regarding the multiplication itself, shouldn't pure understanding of the meaning of multiplication(it's a summation basically) be enough for 'AI' to call it a day? If AI or human understands that, then the rest is computation part. We already got that covered, so instead of having 'AI' learn it on its own on crazy amount of data and get it right 99% of time, shouldn't we just give it a calculator? Somebody PLEEAASE give this AI a calculator :-)
A lot of savants that are able to do really cool calculations, or even people that have synesthesia seeing numbers as colors, don't actually do "real" calculations.
I think most humans that do math aren't actually literally computing things as some kind of logic machine.
We can produce logic, and follow the steps of using that logic, but it doesn't seem to me that our cognition is some kind of logic machine itself.
True. Generally it seems like you're visualizing things, moving stuff around, seeing vague patterns and trying to make them more clear. IDK how a transformer architecture would fit all of that in its context, or use it productivity once it's there. You can't just keep appending forever, but you also can't delete stuff either, because unlike humans, a deletion is a hard delete; there's no fuzzy remembrance left to rely on, so even deleting bad ideas is dangerous because it'll forget that it was a bad idea and infinite loop. Symbols manipulation doesn't come until the end, after you have a good idea what that part will look like.
Hmm, I wonder what happens if you let them manipulate their own context symbolically, maybe something like a stack machine. Perhaps all you need is a "delete" token, or a "replace" flag. That way you don't have context full of irrelevant information.
I guess the challenge is, where would the training data come from? Data on the internet is in its final form so "next token" is never a delete.
Edit: I guess in essence, that's what reasoning LLMs already do. IIUC the thought blocks are ephemeral, and only the response is maintained for the chat. Maybe there'd be some benefit of doing this recursively? But that's also kind of what subagents are for. So, perhaps nothing new here.
Language _is_ the symbolic manipulation system par excellence though.
There's equivocation in that statement, though, whether you meant there to be or not. There is clearly a difference in how we manipulate English words for normal human activities and the symbolic manipulation with very strict rules we today associate with mathematics and computer science. Human language goes back thousands of years, into the indefinite past we can't track past. Symbolic manipulation is a much, much more recent development, starting only ~2300 years ago around Euclid and not really coming into full development until much later... you can argue about exactly when it is but I'd personally put it as late as the 19th century for it to be recognized in the modern sense. It must be something different if separated by that many centuries.
To disprove my point, please generate a list of 5 random 5-digit numbers and demonstrate multiplying them in your head as quickly as you can read them. Since you can't, clearly there is something about that that is hard for you, despite the fact that the act of reading this text, maintaining physical homeostasis while you do it, and all the other things your brain is doing as you do this represents a staggering amount of raw computation that is vastly, vastly in excess of what is nominally needed to achieve that computation.
Doing multiplication in your head isn't the point though, you can externalise language and use it to do things you can't do in your head by writing it down.
Mathematics was born out of very careful reasoning that we do through language, we only use formalisms as they allow us to avoid the massive ambiguities that exist in natural language. Formal symbolic manipulation came out of our already existing abilities of symbolic manipulation through language.
What probably works: Ask it to write a python program, but tell it to not use any built-in multiplication functions.
Then your transformer would need to know Python.
I think it should be able to learn multiplication with chain of thought. Without it, it's probably really difficult to generalize the multiplication of two n-digit integers when you have to accumulate up to n products of digits and handle carrying for each output digit.
Yesterday, I learned the opposite. Simon Willison demonstrated in another thread how this works out … see https://news.ycombinator.com/item?id=45686295
That's very cool, but it's not an apples to apples comparison. The reasoning model learned how to do long multiplication. (Either from the internet, or from generated examples of long multiplication that were used to sharpen its reasoning skills. In principle, it might have invented it on its own during RL, but no, I don't think so.)
In this paper, the task is to learn how to multiply, strictly from AxB=C examples, with 4-digit numbers. Their vanilla transformer can't learn it, but the one with (their variant of) chain-of-thought can. These are transformers that have never encountered written text, and are too small to understand any of it anyway.
Maybe the AGI will come with the equivalent of a "Turing Machine" enabling some kind of computability.