I think the interface of LLM with formalized languages is really the future. Because here you can formally verify every statement and deal with hallucinations.
It's obviously not the future (outside of mathematics research). The whole LLM boom we've seen in the past two years comes from one single fact: peopel don't need to learn a new language to use it.
If "better than humans" means when you give it a real world problem, it gives you a mathematical model to describe it (and does it better than human experts), then yes, it's the end game.
If it just solves a few formalized problems with formalized theorems, not so much. You can write a program that solves ALL the problems under formalized theorems already. It just runs very slowly.
I doubt it. Math has the property that you have a way to 100% verify that what you're doing is correct with little cost (as it is done with Lean). Most problems don't have anything close to that.
These kind of experiments are many times orders of magnitude more costly (time, energy, money, safety, etc.) than verifying a mathematical proof with something like Lean.
That's why many think math will be one of the first to crack with AI as there is a relatively cheap and fast feedback loop available.
This has to come with an asterisk, which is that participants had approximately 90 minutes to work on each problem while AlphaProof computed for three days for each of the ones it solved. Looking at this problem specifically, I think that many participants could have solved P6 without the time limit.
(I think you should be very skeptical of anyone who hypes AlphaProof without mentioning this - which is not to suggest that there's nothing there to hype)
I agree that the result is important regardless. But the tradeoff of computing time/cost with problem complexity is hugely important to think about. Finding a proof in a formal language is trivially solvable in theory since you just have to search through possible proofs until you find one ending with the desired statement. The whole practical question is how much time it takes.
Three days per problem is, by many standards, a 'reasonable' amount of time. However there are still unanswered questions, notably that 'three days' is not really meaningful in and of itself. How parallelized was the computation; what was the hardware capacity? And how optimized is AlphaProof for IMO-type problems (problems which, among other things, all have short solutions using elementary tools)? These are standard kinds of critical questions to ask.
Though, if you start solving problems that humans can't or haven't solved, then questions of capacity won't matter much. A speedup in the movement of the maths frontier would be worth many power stations.
For some time a 'superhuman math AI' could be useful for company advertising and getting the attention of VCs. But eventually it would be pretty clear that innovative math research, with vanishingly few exceptions, isn't very useful for making revenue. (I am a mathematician and this is meant with nothing but respect for math research.)
As a mathematician, of course I agree. But in a sentence like:
> A speedup in the movement of the maths frontier would be worth many power stations
who is it 'worth' it to? And to what end? I can say with some confidence that many (likely most, albeit certainly not all) mathematicians do not want data centers and power stations to guzzle energy and do their math for them. It's largely a vision imposed from without by Silicon Valley and Google research teams. What do they want it for and why is it (at least for now) "worth" it to them?
Personally, I don't believe for a second that they want it for the good of the mathematical community. Of course, a few of their individual researchers might have their own personal and altruistic motivations; however I don't think this is so relevant.
More or less. Modern theorem provers, even fully automatic ones, can prove incredibly difficult problems if given enough time. With 3 days and terabytes of memory, perhaps they could? Would be interesting to compare Alphaproof with a standard theorem prover that is given similarly astronomical computing resources.
That is an interesting thought. But I doubt that standard automated theorem proving techniques would have solved this in a 100 years, even with immense resources, as all they do is (basically) brute force search, and it remains an exponential search space. Therefore 3 days does not really matter, indeed, 3 months would still be a result that 10 years ago I would have thought impossible.
If you were to bet on solving problems like "P versus NP" using these technologies combined with human augmentation (or vice versa), what would be the provable time horizon for achieving such a solution? I think we should assume that the solution is also expressible in the current language of math/logic.
The hard part is in the creation of new math to solve these problems not in the use of existing mathematics. So new objects (groups rings fields) etc have to be theorized, their properties understood, and then that new machinery used to crack the existing problems. I think we will get to a place (around 5 years) where AI will be able to solve these problems and create these new objects. I don’t think it’s one of technology I think it’s more financial. Meaning, there isn’t much money to be made doing this (try and justify it for yourself) and so the lack of focus here. I think this is a red herring and there is a gold mine in there some where but it will likely take someone with a lot of cash to fund it out of passion (Vlad Tenev / Harmonic, or Zuck and Meta AI, or the Google / AlphaProof guys) but in the big tech world, they are just a minnow project in a sea of competing initiatives. And so that leaves us at the mercy of open research, which if it is a compute bound problem, is one that may take 10-20 years to crack. I hope I see a solution to RH in my lifetime (and in language that I can understand)
I understand that a group of motivated individuals, even without significant financial resources, could attempt to tackle these challenges, much like the way free and open-source software (FOSS) is developed. The key ingredients would be motivation and intelligence, as well as a shared passion for advancing mathematics and solving foundational problems.
Probably a bad example, P vs NP is the most likely of the millennium problems to be unsolvable, so the answer may be "never".
I'll bet the most technical open problems will be the ones to fall first. What AIs lack in creativity they make up for in ability to absorb a large quantity of technical concepts.
Thank you for the response. I have a follow-up question: Could these AIs contribute to advancements in resolving the P vs NP problem? I recall that the solution to Fermat’s Last Theorem relied on significant progress in elliptic curves. Could we now say that these AI systems might play a similar role in advancing our understanding of P vs NP?
Just my guess as a mathematician. But if LLMs are good for anything it will be for finding surprising connections and applying our existing tools in ways beyond human search. There's a huge space of tools and problems, and human intuition and brute force searching can only go so far. I can imagine that LLMs might start to find combinatorial proofs of topological theorems, maybe even novel theorems. Or vice versa. But I find it difficult to imagine them inventing new tools and objects that are really useful.
Do you have a past example where this already-proven theorem/new tools/objects would have been only possible by human but not AI? Any such example would make your arguments much more approachable by non-mathematicians.
No one is focused on those. They're much more focused on more rote problems.
You might find them used to accelerate research math by helping them with lemmas and checking for errors, and formalizing proofs. That seems realistic in the next couple of years.
There are some AI guys like Christian Szegedy who predict that AI will be a "superhuman mathematician," solving problems like the Riemann hypothesis, by the end of 2026. I don't take it very seriously, but that kind of prognostication is definitely out there.
Why have they still not released a paper aside from a press release? I have to admit I still don't know how auspicious it is that running google hardware for three days apiece was able to find half-page long solutions, given that the promise has always been to solve the Riemann hypothesis with the click of a button. But of course I do recognize that it's a big achievement relative to previous work in automatic theorem proving.
I don't know why so few people realize this, but by solving any of the problems their performance is superhuman for most reasonable definitions of human.
Talking about things like solving the Reimman hypothesis in so many years assumes a little too much about the difficulty of problems that we can't even begin to conceive of a solution for. A better question is what can happen when everybody has access to above average reasoning. Our society is structured around avoiding confronting people with difficult questions, except when they are intended to get the answer wrong.
Well, multiply two large numbers instantly is a superhuman feat a calculator can do. I would hope we are going for a higher bar, like “useful”. Let’s see if this can provide proofs of novel results.
It's worth emphasizing that it's been possible for years to use an automatic theorem prover to prove novel results. The whole problem is to get novel interesting results.
That superhuman capability of "multiplying two large numbers instantly" has transformed human society like not even plumbing has. I really can't see how this you could not consider this useful.
We know that any theorem that is provable at all (in the chosen foundation of mathematics) can be found by patiently enumerating all possible proofs. So, in order to evaluate AlphaProof's achievements, we'd need to know how much of a shortcut AlphaProof achieved. A good proxy for that would be the total energy usage for training and running AlphaProof. A moderate proxy for that would be the number of GPUs / TPUs that were run for 3 days. If it's somebody's laptop, it would be super impressive. If it's 1000s of TPUs, then less so.
> We know that any theorem that is provable at all (in the chosen foundation of mathematics) can be found by patiently enumerating all possible proofs.
It's a direct consequence of the format of a proof. They're finite-length sequences of a finite alphabet of symbols, so of course they're enumerable (the same algorithm you use to count works to enumerate them).
If you want a computer to be able to tell that it found a correct proof once it enumerates it, you need a bit more than that (really just the existence of automated proof checkers is enough for that).
I guess it is tautological from the definition of "provable". A theorem is provable by definition if there is a finite well-formulated formula that has the theorem as consequence (https://en.wikipedia.org/wiki/Theorem paragraph theorem in logic)
Not sure it’s a tautology. It’s not obvious that a recursively enumerable procedure exists for arbitrary formal systems that will eventually reach all theorems derivable via the axioms and transformation rules. For example, if you perform depth-first traversal, you will not reach all theorems.
Hilbert’s program was a (failed) attempt to determine, loosely speaking, whether there was a process or procedure that could discover all mathematical truths. Any theorem depends on the formal system you start with, but the deeper implicit question is: where do the axioms come from and can we discover all of them (answer: “unknown” and “no”)?
Imagine thinking this is a recent “theorem” of computer science rather than a well-known result in mathematics from very long ago. People thought about the nature of computation long before CS was even a thing.
> A better question is what can happen when everybody has access to above average reasoning. Our society is structured around avoiding confronting people with difficult questions, except when they are intended to get the answer wrong.
What does this have to do with a hypothetical automatic theorem prover?
Being logical in social life is pretty much completely different from being logical in a mathematical argument, especially in a formal theorem proving environment. (Just try to write any kind of cultural proposition in a formal language!)
That's the way things are now, but this regime came about when proving things took intense concentration and specialized skills that very few people had. Contrast going to look something up in a library with googling something during a conversation.
Google stopped publishing interesting AI work since they had their AI lead taken away by OpenAI, and mostly with tech that was pioneered, but not monetised by Google like transformers.
I imagine they are under pressure not to make this mistake again.
I think the interface of LLM with formalized languages is really the future. Because here you can formally verify every statement and deal with hallucinations.
It's obviously not the future (outside of mathematics research). The whole LLM boom we've seen in the past two years comes from one single fact: peopel don't need to learn a new language to use it.
Both comments can be right. People don’t need to know HTML to use the internet.
Anyone else feel like mathematics is sort of the endgame? I.e., once ML can do it better than humans, that’s basically it?
If "better than humans" means when you give it a real world problem, it gives you a mathematical model to describe it (and does it better than human experts), then yes, it's the end game.
If it just solves a few formalized problems with formalized theorems, not so much. You can write a program that solves ALL the problems under formalized theorems already. It just runs very slowly.
I doubt it. Math has the property that you have a way to 100% verify that what you're doing is correct with little cost (as it is done with Lean). Most problems don't have anything close to that.
to be fair, humans also have to run experiments to discover whether their models fit nature - AI will do it too.
These kind of experiments are many times orders of magnitude more costly (time, energy, money, safety, etc.) than verifying a mathematical proof with something like Lean. That's why many think math will be one of the first to crack with AI as there is a relatively cheap and fast feedback loop available.
More information about the language used in the proofs: https://en.wikipedia.org/wiki/Lean_(proof_assistant)
in the first question, why do they even specify ⌊n⌋ (and ⌊2n⌋ and so on) when n is an integer?
Alpha need not be an integer, we have to prove that it is
Should have read more carefully, thank you!
“Only 5/509 participants solved P6”
This has to come with an asterisk, which is that participants had approximately 90 minutes to work on each problem while AlphaProof computed for three days for each of the ones it solved. Looking at this problem specifically, I think that many participants could have solved P6 without the time limit.
(I think you should be very skeptical of anyone who hypes AlphaProof without mentioning this - which is not to suggest that there's nothing there to hype)
Certainly an interesting information that AlphaProof needed three days. But does it matter for evaluating the importance of this result? No.
I agree that the result is important regardless. But the tradeoff of computing time/cost with problem complexity is hugely important to think about. Finding a proof in a formal language is trivially solvable in theory since you just have to search through possible proofs until you find one ending with the desired statement. The whole practical question is how much time it takes.
Three days per problem is, by many standards, a 'reasonable' amount of time. However there are still unanswered questions, notably that 'three days' is not really meaningful in and of itself. How parallelized was the computation; what was the hardware capacity? And how optimized is AlphaProof for IMO-type problems (problems which, among other things, all have short solutions using elementary tools)? These are standard kinds of critical questions to ask.
Though, if you start solving problems that humans can't or haven't solved, then questions of capacity won't matter much. A speedup in the movement of the maths frontier would be worth many power stations.
For some time a 'superhuman math AI' could be useful for company advertising and getting the attention of VCs. But eventually it would be pretty clear that innovative math research, with vanishingly few exceptions, isn't very useful for making revenue. (I am a mathematician and this is meant with nothing but respect for math research.)
"Making revenue" is far from being the only metric by which we deem something worthy.
As a mathematician, of course I agree. But in a sentence like:
> A speedup in the movement of the maths frontier would be worth many power stations
who is it 'worth' it to? And to what end? I can say with some confidence that many (likely most, albeit certainly not all) mathematicians do not want data centers and power stations to guzzle energy and do their math for them. It's largely a vision imposed from without by Silicon Valley and Google research teams. What do they want it for and why is it (at least for now) "worth" it to them?
Personally, I don't believe for a second that they want it for the good of the mathematical community. Of course, a few of their individual researchers might have their own personal and altruistic motivations; however I don't think this is so relevant.
More or less. Modern theorem provers, even fully automatic ones, can prove incredibly difficult problems if given enough time. With 3 days and terabytes of memory, perhaps they could? Would be interesting to compare Alphaproof with a standard theorem prover that is given similarly astronomical computing resources.
That is an interesting thought. But I doubt that standard automated theorem proving techniques would have solved this in a 100 years, even with immense resources, as all they do is (basically) brute force search, and it remains an exponential search space. Therefore 3 days does not really matter, indeed, 3 months would still be a result that 10 years ago I would have thought impossible.
If you were to bet on solving problems like "P versus NP" using these technologies combined with human augmentation (or vice versa), what would be the provable time horizon for achieving such a solution? I think we should assume that the solution is also expressible in the current language of math/logic.
The hard part is in the creation of new math to solve these problems not in the use of existing mathematics. So new objects (groups rings fields) etc have to be theorized, their properties understood, and then that new machinery used to crack the existing problems. I think we will get to a place (around 5 years) where AI will be able to solve these problems and create these new objects. I don’t think it’s one of technology I think it’s more financial. Meaning, there isn’t much money to be made doing this (try and justify it for yourself) and so the lack of focus here. I think this is a red herring and there is a gold mine in there some where but it will likely take someone with a lot of cash to fund it out of passion (Vlad Tenev / Harmonic, or Zuck and Meta AI, or the Google / AlphaProof guys) but in the big tech world, they are just a minnow project in a sea of competing initiatives. And so that leaves us at the mercy of open research, which if it is a compute bound problem, is one that may take 10-20 years to crack. I hope I see a solution to RH in my lifetime (and in language that I can understand)
I understand that a group of motivated individuals, even without significant financial resources, could attempt to tackle these challenges, much like the way free and open-source software (FOSS) is developed. The key ingredients would be motivation and intelligence, as well as a shared passion for advancing mathematics and solving foundational problems.
Ok but how do you get around needing a 10k or 100k h100 cluster
It is well known that cloud services like Google Cloud subsidizes some projects and we don't even know if in a few years improvements will arise.
Probably a bad example, P vs NP is the most likely of the millennium problems to be unsolvable, so the answer may be "never".
I'll bet the most technical open problems will be the ones to fall first. What AIs lack in creativity they make up for in ability to absorb a large quantity of technical concepts.
Thank you for the response. I have a follow-up question: Could these AIs contribute to advancements in resolving the P vs NP problem? I recall that the solution to Fermat’s Last Theorem relied on significant progress in elliptic curves. Could we now say that these AI systems might play a similar role in advancing our understanding of P vs NP?
Just my guess as a mathematician. But if LLMs are good for anything it will be for finding surprising connections and applying our existing tools in ways beyond human search. There's a huge space of tools and problems, and human intuition and brute force searching can only go so far. I can imagine that LLMs might start to find combinatorial proofs of topological theorems, maybe even novel theorems. Or vice versa. But I find it difficult to imagine them inventing new tools and objects that are really useful.
Do you have a past example where this already-proven theorem/new tools/objects would have been only possible by human but not AI? Any such example would make your arguments much more approachable by non-mathematicians.
No one is focused on those. They're much more focused on more rote problems.
You might find them used to accelerate research math by helping them with lemmas and checking for errors, and formalizing proofs. That seems realistic in the next couple of years.
There are some AI guys like Christian Szegedy who predict that AI will be a "superhuman mathematician," solving problems like the Riemann hypothesis, by the end of 2026. I don't take it very seriously, but that kind of prognostication is definitely out there.
Why have they still not released a paper aside from a press release? I have to admit I still don't know how auspicious it is that running google hardware for three days apiece was able to find half-page long solutions, given that the promise has always been to solve the Riemann hypothesis with the click of a button. But of course I do recognize that it's a big achievement relative to previous work in automatic theorem proving.
I don't know why so few people realize this, but by solving any of the problems their performance is superhuman for most reasonable definitions of human.
Talking about things like solving the Reimman hypothesis in so many years assumes a little too much about the difficulty of problems that we can't even begin to conceive of a solution for. A better question is what can happen when everybody has access to above average reasoning. Our society is structured around avoiding confronting people with difficult questions, except when they are intended to get the answer wrong.
Well, multiply two large numbers instantly is a superhuman feat a calculator can do. I would hope we are going for a higher bar, like “useful”. Let’s see if this can provide proofs of novel results.
The ability to multiply two numbers superhumanly has already transformed society.
It's worth emphasizing that it's been possible for years to use an automatic theorem prover to prove novel results. The whole problem is to get novel interesting results.
That superhuman capability of "multiplying two large numbers instantly" has transformed human society like not even plumbing has. I really can't see how this you could not consider this useful.
We know that any theorem that is provable at all (in the chosen foundation of mathematics) can be found by patiently enumerating all possible proofs. So, in order to evaluate AlphaProof's achievements, we'd need to know how much of a shortcut AlphaProof achieved. A good proxy for that would be the total energy usage for training and running AlphaProof. A moderate proxy for that would be the number of GPUs / TPUs that were run for 3 days. If it's somebody's laptop, it would be super impressive. If it's 1000s of TPUs, then less so.
> We know that any theorem that is provable at all (in the chosen foundation of mathematics) can be found by patiently enumerating all possible proofs.
Which computer science theorem is this from?
It's a direct consequence of the format of a proof. They're finite-length sequences of a finite alphabet of symbols, so of course they're enumerable (the same algorithm you use to count works to enumerate them).
If you want a computer to be able to tell that it found a correct proof once it enumerates it, you need a bit more than that (really just the existence of automated proof checkers is enough for that).
It’s just an obvious statement. If a proof exists, you will eventually get to it.
I guess it is tautological from the definition of "provable". A theorem is provable by definition if there is a finite well-formulated formula that has the theorem as consequence (https://en.wikipedia.org/wiki/Theorem paragraph theorem in logic)
Not sure it’s a tautology. It’s not obvious that a recursively enumerable procedure exists for arbitrary formal systems that will eventually reach all theorems derivable via the axioms and transformation rules. For example, if you perform depth-first traversal, you will not reach all theorems.
Hilbert’s program was a (failed) attempt to determine, loosely speaking, whether there was a process or procedure that could discover all mathematical truths. Any theorem depends on the formal system you start with, but the deeper implicit question is: where do the axioms come from and can we discover all of them (answer: “unknown” and “no”)?
Imagine thinking this is a recent “theorem” of computer science rather than a well-known result in mathematics from very long ago. People thought about the nature of computation long before CS was even a thing.
> A better question is what can happen when everybody has access to above average reasoning. Our society is structured around avoiding confronting people with difficult questions, except when they are intended to get the answer wrong.
What does this have to do with a hypothetical automatic theorem prover?
Logic is pretty much absent from our culture and daily life, but that could be due to its limited supply.
Being logical in social life is pretty much completely different from being logical in a mathematical argument, especially in a formal theorem proving environment. (Just try to write any kind of cultural proposition in a formal language!)
That's the way things are now, but this regime came about when proving things took intense concentration and specialized skills that very few people had. Contrast going to look something up in a library with googling something during a conversation.
Google stopped publishing interesting AI work since they had their AI lead taken away by OpenAI, and mostly with tech that was pioneered, but not monetised by Google like transformers.
I imagine they are under pressure not to make this mistake again.