I had a surprising interaction with Gemini 2.5 Pro that this project reminds me of. I was asking the LLM for help using an online CAS system to solve a system of equations, and the CAS system wasn't working as I expected. After a couple back and forths with Gemini about the CAS system, Gemini just gave me the solution. I was surprised because it's the kind of thing I don't expect LLMs to be good at. It said it used Python's sympy symbolic computation package to arrive at the solution. So, yes, the marriage of fuzzy LLMs with more rigorous tools can have powerful effects.
Just like humans... we are not so good at hard number crunching, but we can invent computers that are amazing at it. And with a lot of effort we can make a program that uses a whole lot of number crunching to be ok at predicting text but kind of bad at crunching hard numbers. And then that program can predict how to create and use programs which are good at number crunching.
Maybe the number crunching program the text generation program creates will, with enough effort become good at generating text, an will in turn make another number crunching computer and then…
Parent post is talking about symbolic manipulation, not rote number crunching, which is exactly what we're supposed to be good at and machines are supposed to be bad at.
I really like LLM+sympy for math. I have the LLM write me a sympy program, so I can trust that the symbolic manipulation is done correctly.
The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.
I use helper functions to easily render from the sympy code to latex, etc.
A lot of the math behind this quantum eraser experiment was done this way.
I get having it walk you through figuring out a problem with a tool: seems like a good idea and it clearly worked even better than expected. But deliberately coaxing an LLM into doing math correctly instead of a CAS because you’ve got one handy seems like moving apartments with dozens of bus trips rather than taking the bus to a truck rental place, just because you’ve already got a bus pass.
LLMs are statistical language models (d'uh) not reasoners after all. I found generating logic programs, and Prolog source specifically, to work unreasonably well, though [1], maybe because Prolog was introduced for symbolic natural language processing and there's a wealth of translation examples in the training set. Might be worth checking out Z3's alternative Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.
Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .
If you ever feel like chatting and discussing more details, happy to chat!
This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.
Although that work is not public, you can play with the generally available product here!
Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.
Why does this have to be true? For example, if you have a different LLM that is judging than the one being judged then their biases could at least be different. Also, as their reasoning abilities improve wouldn't LLM judges approach the abilities of human judges?
LLMs have positional, response length and hedge word biases (and that's just what's rigorously demonstrated in papers) that wash out differences between high performing answers as you approach the limit of your objective. Imagine if you were trying to optimize a function and the measurement function emitted random biased noise, at some point you wouldn't be able to accurately identify the impact of your changes.
Indeed - human judges suck on average. And you can prompt an llm judge to look for particular kinds of problems, then throw the ensemble of judges at an output to nitpick. (Essentially, bake in a diversity of biases through a collection of prompts.)
The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.
E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
Illustration of this can be seen in Fig 14, 15 on Page 29.
In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
So the core idea is to use an LLM to draft reasoning as a structured, JSON domain-specific language (DSL), then deterministically translate that into first-order logic and verify it with a theorem prover (Z3).
Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.
Formal logic serves as a useful filter. In other words, "crap in, filtered crap out" - remember, evolution works with absolutely random, "crap" mutations, which then are "filtered" by the environment.
That's subjective. One could argue all the things we invented in the past few thousands years were crap. Life would have been much easier in the caves, albeit shorter.
Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.
Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)
LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.
I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.
Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.
When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
What are you talking about? OpenAI has supported structured json output in the API since 2023. Only the current structured output API was introduced by OpenAI in summer 2024, but it was primarily a usability improvement that still runs json behind the scenes.
You're right about the 2023 JSON mode, but our project required enforcing a much more complex DSL grammar (look in Appendix for details), not just ensuring a *valid JSON object*. The newer structured output APIs are a significant improvement, but the earlier tools weren't a fit for the specific constraints we were working under at the time.
Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.
I believe you! but when an internet reply leads with "what are you talking about?", it's likely to pattern-match this way for many readers. If that's not your intent, it's best to use an alternate wording.
I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.
I would agree if you are a normal dev who doesn't work in the field. But even then reading the documentation once a year would have brought you insane benefits regarding this particular issue. And for ML researchers there is no excuse for stuff like that at this point.
I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.
Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!
Depends on how strictly you define your types. Are you using pydantic to pass the information to the API? There are a few pitfalls with this, because not everything is fully supported and it gets turned into json behind the scenes. But in principle, the autoregressive engine will simply not allow tokens that break the supplied schema.
Not sure if I've been using it wrong but I've tried using the Zod-to-structured-output helper with GPT-5 and often gotten weird stuff like trailing commas that break a parse or seeing multiple JSON responses in the same response.
Ultimately there are still going to be bugs. For this reason and several others you'll still need it wrapped in a retry.
Yeah that sounds 100% like a user or middleware issue. Don't bother with these wrappers, they are always outdated anyways. Learn how to use the API directly, it will save you a ton of headaches. And it's really not that hard.
No, we're using the OpenAI vendored version of zod-to-json-schema via https://github.com/transitive-bullshit/openai-zod-to-json-sc..., and applying it directly to the `json_schema` field of the OpenAI API. Maybe we have a subtle bug somewhere but I'd expect a 400 response if we were genuinely sending a malformed request.
It's a relatively new feature, also people need actual professional training to become true LLM developers using them to their fullest and not just developers that happen to call an LLM API here and there. Takes a lot of time and effort.
I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.
The secret the big companies don't want to tell you is that you can turn all their models into reasoning models that way. You even have full control over the reasoning process and can make it adhere to a specific format, e.g. the ones used in legal settings. I've built stuff like that using plain old gpt-4o and it was even better than the o series.
I'm honestly confused why we can't determine how LLMs come to their decisions in the general sense. Is it not possible to log every step as the neural network / vector db / magic happens? Is it merely impractical, or is it actually something that's genuinely difficult to do?
Everything happens in an opaque super-high-dimensional numerical space that was "organically grown" not engineered, so we don't really understand what's going on.
My understanding is that it's neither impractical nor genuinely difficult, it's just that the "logging every step" approach provides explanations of their "reasoning" that are completely meaningless to us, as humans. It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain. Theoretically, yes, that should be sufficient to explain their color preferences, in that it fully models their brain. But practically, the explanation would be phrased in terms of atomic configurations in a way that makes much less sense to us than "oh, this person likes red because they like roses".
Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.
Good luck figuring out what number represents what.
LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.
Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)
Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
Probabilistic constraints are all around us. You learn that the sine function is the ratio of the length of the side of the right triangle opposite to the angle to the length of the side opposite to the right angle, so obviously the sine is always positive. Yet your thinking should be flexible enough to allow changing the definition to the ordinate of the point on the unit circle where the line corresponding to the given angle and drawn from zero intersects that circle. So your knowledge - the symbolic one - can also be probabilistic.
You're thinking along the right track but without formalization it goes nowhere fast. By layering of differential geometry on top of probability and then maybe category theoretic logic on top of that, each layer constraining the one below it, and all layers cohering, you get somewhere... There is work that's been done in this area, and I was recently interviewed by a journalist who published a high level article on it on Forbes (Why LLMs are failing) and it links to the actual technical work (at first to my high level presentation then Prof. L. Thorne McCarty's work): https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...
I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.
If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent. If you don't fall for the marketing hype & study both the philosophical & mathematical literature on computation then it is obvious that computers (or any mechanical gadget for that matter) can not qualify for any reasonable definition of "thinking" unless you agree that all functionally equivalent manifestations of arithmetic must be considered "thinking", including cascading dominoes that correspond to the arithmetic operations in an LLM.
>If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent.
Sure, but if you assume that physical reality can be simulated by a Turing machine, then (computational practicality aside) one could do the same thing with a human brain.
Unless you buy into some notion of magical thinking as pertains to human consciousness.
No magic is necessary to understand that carbon & silicon are not equivalent. The burden of proof is on those who think silicon can be a substitute for carbon & all that it entails. I don't buy into magical thinking like Turing machines being physically realizable b/c I have studied enough math & computer science to not be confused by abstractions & their physical realizations.
I recently wrote a simulation of water molecules & got really confused when the keyboard started getting water condensation on it. I concluded that simulating water was equivalent to manifesting it in reality & immediately stopped the simulation b/c I didn't want to short-circuit the CPU.
I understand computers, software, & the theory of computation well enough to know that there is no algorithm or even a theoretical algorithmic construction that can be considered thought. Unless you are willing to concede that thinking is nothing more than any number of models equivalent to a Turing machine, e.g. lambda calculus, Post systems, context aware grammars, carefully laid out dominoes, permutations of bit strings, etc. then you must admit that computers are not thinking. If you believe computers are thinking then you must also admit dominoes are thinking when falling in a cascading chain.
The normative importance of a fact may increase when more number of people start willfully ignoring it for shorter-term profit.
Imagine somebody in 2007: "It's so funny to me that people are still adamant about mortgage default risk after it's become a completely moot point because nobody cares in this housing market."
No such thing as an expert consensus on anything about LLMs these days, just different forms of grift.
My point is, the question if an LLM reasons the same way a human does is about as useful as "does a submarine swim" or "can a telephone talk". The results speak for themselves.
Idk, before this people from your camp were saying LLMs can't even understand anything. Always moving the goalposts. Then it'll be they can't feel or can't something else just to be pointlessly contrarian. Anyway, wrong idea.
There have been enough cases of models providing novel results that it's clear that whatever human trait they supposedly lack they don't really need. A car does not need legs, it does things differently. Having legs would even be a major detriment and would hold it back from achieving its top performance.
That's what those brain simulating projects are conceptually btw: cars with legs or planes with flapping wings. That's why they all fail, the approach makes no sense.
I had a surprising interaction with Gemini 2.5 Pro that this project reminds me of. I was asking the LLM for help using an online CAS system to solve a system of equations, and the CAS system wasn't working as I expected. After a couple back and forths with Gemini about the CAS system, Gemini just gave me the solution. I was surprised because it's the kind of thing I don't expect LLMs to be good at. It said it used Python's sympy symbolic computation package to arrive at the solution. So, yes, the marriage of fuzzy LLMs with more rigorous tools can have powerful effects.
Just like humans... we are not so good at hard number crunching, but we can invent computers that are amazing at it. And with a lot of effort we can make a program that uses a whole lot of number crunching to be ok at predicting text but kind of bad at crunching hard numbers. And then that program can predict how to create and use programs which are good at number crunching.
I love this kind of thought. Thanks.
Maybe the number crunching program the text generation program creates will, with enough effort become good at generating text, an will in turn make another number crunching computer and then…
Parent post is talking about symbolic manipulation, not rote number crunching, which is exactly what we're supposed to be good at and machines are supposed to be bad at.
I really like LLM+sympy for math. I have the LLM write me a sympy program, so I can trust that the symbolic manipulation is done correctly.
The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.
I use helper functions to easily render from the sympy code to latex, etc.
A lot of the math behind this quantum eraser experiment was done this way.
https://github.com/paul-gauthier/entangled-pair-quantum-eras...
I get having it walk you through figuring out a problem with a tool: seems like a good idea and it clearly worked even better than expected. But deliberately coaxing an LLM into doing math correctly instead of a CAS because you’ve got one handy seems like moving apartments with dozens of bus trips rather than taking the bus to a truck rental place, just because you’ve already got a bus pass.
LLMs are statistical language models (d'uh) not reasoners after all. I found generating logic programs, and Prolog source specifically, to work unreasonably well, though [1], maybe because Prolog was introduced for symbolic natural language processing and there's a wealth of translation examples in the training set. Might be worth checking out Z3's alternative Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.
[1]: https://quantumprolog.sgml.net/llm-demo/part1.html
[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax
Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].
[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...
This is an interesting approach.
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .
If you ever feel like chatting and discussing more details, happy to chat!
Could you share an example of such policy? I'm struggling to think of something defined well enough in the real world to apply in Lean.
That’s pretty cool. It would be super useful to identify contradictory guidance systematically.
This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.
Although that work is not public, you can play with the generally available product here!
[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.
Why does this have to be true? For example, if you have a different LLM that is judging than the one being judged then their biases could at least be different. Also, as their reasoning abilities improve wouldn't LLM judges approach the abilities of human judges?
LLMs have positional, response length and hedge word biases (and that's just what's rigorously demonstrated in papers) that wash out differences between high performing answers as you approach the limit of your objective. Imagine if you were trying to optimize a function and the measurement function emitted random biased noise, at some point you wouldn't be able to accurately identify the impact of your changes.
Indeed - human judges suck on average. And you can prompt an llm judge to look for particular kinds of problems, then throw the ensemble of judges at an output to nitpick. (Essentially, bake in a diversity of biases through a collection of prompts.)
The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.
E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
Illustration of this can be seen in Fig 14, 15 on Page 29.
In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
Who makes the rules?
So the core idea is to use an LLM to draft reasoning as a structured, JSON domain-specific language (DSL), then deterministically translate that into first-order logic and verify it with a theorem prover (Z3).
Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.
That is exactly the kind of things that I hope LLM will help us achieve before the next AI winter.
I need this same with Mizar https://wiki.mizar.org/
Am I reading this right? Statistical LLM outputs pushed through a formal logic model? Wouldn't that be a case of "crap in, crap out"?
Formal logic serves as a useful filter. In other words, "crap in, filtered crap out" - remember, evolution works with absolutely random, "crap" mutations, which then are "filtered" by the environment.
That's subjective. One could argue all the things we invented in the past few thousands years were crap. Life would have been much easier in the caves, albeit shorter.
Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.
Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)
LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.
I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.
Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.
When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
What are you talking about? OpenAI has supported structured json output in the API since 2023. Only the current structured output API was introduced by OpenAI in summer 2024, but it was primarily a usability improvement that still runs json behind the scenes.
You're right about the 2023 JSON mode, but our project required enforcing a much more complex DSL grammar (look in Appendix for details), not just ensuring a *valid JSON object*. The newer structured output APIs are a significant improvement, but the earlier tools weren't a fit for the specific constraints we were working under at the time.
> What are you talking about?
Please edit out swipes like this from your HN comments—this is in the site guidelines: https://news.ycombinator.com/newsguidelines.html. It comes across as aggressive, and we want curious conversation here.
Your comment would be fine without that bit.
This is not meant as snide, I'm literally confused if I might have misunderstood the problem here. Because the solution would be so obvious.
I believe you! but when an internet reply leads with "what are you talking about?", it's likely to pattern-match this way for many readers. If that's not your intent, it's best to use an alternate wording.
I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.
I would agree if you are a normal dev who doesn't work in the field. But even then reading the documentation once a year would have brought you insane benefits regarding this particular issue. And for ML researchers there is no excuse for stuff like that at this point.
I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.
Hey, yes! This is because the DSL (Domain Specific Language) is pretty complex, and the LLM finds it hard. We prototype a much more effective version using SMT in our NeurIPS 2025 paper (https://arxiv.org/abs/2505.20047). We shall soon open source that code!
Depends on how strictly you define your types. Are you using pydantic to pass the information to the API? There are a few pitfalls with this, because not everything is fully supported and it gets turned into json behind the scenes. But in principle, the autoregressive engine will simply not allow tokens that break the supplied schema.
Not sure if I've been using it wrong but I've tried using the Zod-to-structured-output helper with GPT-5 and often gotten weird stuff like trailing commas that break a parse or seeing multiple JSON responses in the same response.
Ultimately there are still going to be bugs. For this reason and several others you'll still need it wrapped in a retry.
Yeah that sounds 100% like a user or middleware issue. Don't bother with these wrappers, they are always outdated anyways. Learn how to use the API directly, it will save you a ton of headaches. And it's really not that hard.
No, we're using the OpenAI vendored version of zod-to-json-schema via https://github.com/transitive-bullshit/openai-zod-to-json-sc..., and applying it directly to the `json_schema` field of the OpenAI API. Maybe we have a subtle bug somewhere but I'd expect a 400 response if we were genuinely sending a malformed request.
Yep from time to time.
It's a relatively new feature, also people need actual professional training to become true LLM developers using them to their fullest and not just developers that happen to call an LLM API here and there. Takes a lot of time and effort.
I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.
yes this can also improve the said reasoning.
The secret the big companies don't want to tell you is that you can turn all their models into reasoning models that way. You even have full control over the reasoning process and can make it adhere to a specific format, e.g. the ones used in legal settings. I've built stuff like that using plain old gpt-4o and it was even better than the o series.
I'm honestly confused why we can't determine how LLMs come to their decisions in the general sense. Is it not possible to log every step as the neural network / vector db / magic happens? Is it merely impractical, or is it actually something that's genuinely difficult to do?
Everything happens in an opaque super-high-dimensional numerical space that was "organically grown" not engineered, so we don't really understand what's going on.
My understanding is that it's neither impractical nor genuinely difficult, it's just that the "logging every step" approach provides explanations of their "reasoning" that are completely meaningless to us, as humans. It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain. Theoretically, yes, that should be sufficient to explain their color preferences, in that it fully models their brain. But practically, the explanation would be phrased in terms of atomic configurations in a way that makes much less sense to us than "oh, this person likes red because they like roses".
It would be like logging a bunch of random noise from anyone's perspective except the LLM's.
Chat GPT-4 has alegedly 1.8 trillion parameters.
Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.
Good luck figuring out what number represents what.
Wouldn't that mean it's totally impractical for day-to-day usage, but a researcher or team of researchers could solve this?
LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.
Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)
Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
Probabilistic constraints are all around us. You learn that the sine function is the ratio of the length of the side of the right triangle opposite to the angle to the length of the side opposite to the right angle, so obviously the sine is always positive. Yet your thinking should be flexible enough to allow changing the definition to the ordinate of the point on the unit circle where the line corresponding to the given angle and drawn from zero intersects that circle. So your knowledge - the symbolic one - can also be probabilistic.
You're thinking along the right track but without formalization it goes nowhere fast. By layering of differential geometry on top of probability and then maybe category theoretic logic on top of that, each layer constraining the one below it, and all layers cohering, you get somewhere... There is work that's been done in this area, and I was recently interviewed by a journalist who published a high level article on it on Forbes (Why LLMs are failing) and it links to the actual technical work (at first to my high level presentation then Prof. L. Thorne McCarty's work): https://www.forbes.com/sites/hessiejones/2025/09/30/llms-are...
This is proof of verifiable logic. Computers can not think so calling it proof of thought misrepresents what's actually happening.
I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.
If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent. If you don't fall for the marketing hype & study both the philosophical & mathematical literature on computation then it is obvious that computers (or any mechanical gadget for that matter) can not qualify for any reasonable definition of "thinking" unless you agree that all functionally equivalent manifestations of arithmetic must be considered "thinking", including cascading dominoes that correspond to the arithmetic operations in an LLM.
>If you believe computers can think then you must be able to explain why a chain of dominoes is also thinking when I convert an LLM from transistor relay switches into the domino equivalent.
Sure, but if you assume that physical reality can be simulated by a Turing machine, then (computational practicality aside) one could do the same thing with a human brain.
Unless you buy into some notion of magical thinking as pertains to human consciousness.
No magic is necessary to understand that carbon & silicon are not equivalent. The burden of proof is on those who think silicon can be a substitute for carbon & all that it entails. I don't buy into magical thinking like Turing machines being physically realizable b/c I have studied enough math & computer science to not be confused by abstractions & their physical realizations.
The proof immediately follows from the ability of silicon systems in principle to model carbon ones with arbitrary precision.
I recently wrote a simulation of water molecules & got really confused when the keyboard started getting water condensation on it. I concluded that simulating water was equivalent to manifesting it in reality & immediately stopped the simulation b/c I didn't want to short-circuit the CPU.
And your definition of thinking is?
Not arithmetic or boolean algebra. What's your definition?
> this whole "computers can't think" thing is making LLM skepticism seem very unscientific.
It's just shorthand for "that's an extraordinary claim and nobody has provided any remotely extraordinary evidence to support it."
Do you understand human thinking well enough to determine what can think and what can't? We have next to no idea how an organic brain works.
I understand computers, software, & the theory of computation well enough to know that there is no algorithm or even a theoretical algorithmic construction that can be considered thought. Unless you are willing to concede that thinking is nothing more than any number of models equivalent to a Turing machine, e.g. lambda calculus, Post systems, context aware grammars, carefully laid out dominoes, permutations of bit strings, etc. then you must admit that computers are not thinking. If you believe computers are thinking then you must also admit dominoes are thinking when falling in a cascading chain.
Reasoning? LLMs can not reason, why is it always assumed they reason? They mimic reasoning.
How can you know?
It's so funny to me that people are still adamant about this like two years after it's become a completely moot point.
The normative importance of a fact may increase when more number of people start willfully ignoring it for shorter-term profit.
Imagine somebody in 2007: "It's so funny to me that people are still adamant about mortgage default risk after it's become a completely moot point because nobody cares in this housing market."
Didn't we have economists' consensus then about what's going to happen?
Experts are adamant about this. Just take a look at https://youtu.be/iRqpsCHqLUI
No such thing as an expert consensus on anything about LLMs these days, just different forms of grift.
My point is, the question if an LLM reasons the same way a human does is about as useful as "does a submarine swim" or "can a telephone talk". The results speak for themselves.
LLMs do not reason. Not hard to understand.
Idk, before this people from your camp were saying LLMs can't even understand anything. Always moving the goalposts. Then it'll be they can't feel or can't something else just to be pointlessly contrarian. Anyway, wrong idea.
There have been enough cases of models providing novel results that it's clear that whatever human trait they supposedly lack they don't really need. A car does not need legs, it does things differently. Having legs would even be a major detriment and would hold it back from achieving its top performance.
That's what those brain simulating projects are conceptually btw: cars with legs or planes with flapping wings. That's why they all fail, the approach makes no sense.