Chain-of-code is better than chain-of-thought because it's more grounded, more specific, and achieves a lot of useful compression. But my bet is that the proposed program-of-thought is too specific. Moving all the way from "very fuzzy specification" to "very concrete code" skips all of the space in the middle, and now there's no room to iterate without a) burning lots of tokens and b) getting bogged down in finding and fixing whatever new errors are introduced in the translated representations. IOW, when there's an error, will it be in the code itself or in the scenario that code was supposed to be representing?
I think the intuition that lots of people jumped to early about how "specs are the new code" was always correct, but at the same time it was absolutely nuts to think that specs can be represented in good ways with natural language and bullet-lists in markdown. We need chain-of-spec that's leveraging something semi-formal and then iterating on that representation, probably with feedback from other layers. Natural-language provides constraints, guess-and-check code generation is sort at the implementation level, but neither are actually the specification which is the heart of the issue. A perfect intermediate language will probably end up being something pretty familiar that leverages and/or combines existing formal methods from model-checkers, logic, games, discrete simulations, graphs, UML, etc. Why? It's just very hard to beat this stuff for compression, and this is what all the "context compaction" things are really groping towards anyway. See also the wisdom about "programming is theory building" and so on.
I think if/when something like that starts getting really useful you probably won't hear much about it, and there won't be a lot of talk about the success of hybrid-systems and LLMs+symbolics. Industry giants would have a huge vested interest in keeping the useful intermediate representation/languages a secret-sauce. Why? Well, they can pretend they are still doing something semi-magical with scale and sufficiently deep chain-of-thought and bill for extra tokens. That would tend to preserve the appearance of a big-data and big-computing moat for training and inference even if it is gradually drying up.
DSPy implemented program of thought since a long time ago and it works great to solve user queries with code.
What is great is that you can define DSPy signature of the type “question, data -> answer” where “data” is a pandas dataframe, then DSPy prompts the llm to answer the question using the data and python code. Extremely powerful.
This seems to be incorporated into current LLM generations already -- when code execution is enabled both GPT-5.x and Claude 4.x automatically seem to execute Python code to help with reasoning steps.
If you compare the outputs of a CoT input vs a control input, the outputs will have the reasoning step either way for the current generation of models.
Afaik PaLM (Google's OG big models) tried this trick, but it didn't work for them. I think it's because PaL used descriptive inline comments + meaningful variable names. Compare the following:
Steve Krouse had an amazing rant two weeks back on/against MCP, and how asking AI to write code to call MCP servers has eaten away at actually calling tools. It feels similar, of code being a more grounded system.
https://x.com/stevekrouse/status/1988641250329989533
I call that self-destructive prompting in the sense that you use AI to output programs that replace calling the AI in the future. The paper seems to indicate that this also brings much better results. However it's subject to attacks as running generated code is usually unsafe. A sandbox has to be used, major agentic AI players are providing some solutions, like Langchain sandbox released earlier this year.
Chain-of-code is better than chain-of-thought because it's more grounded, more specific, and achieves a lot of useful compression. But my bet is that the proposed program-of-thought is too specific. Moving all the way from "very fuzzy specification" to "very concrete code" skips all of the space in the middle, and now there's no room to iterate without a) burning lots of tokens and b) getting bogged down in finding and fixing whatever new errors are introduced in the translated representations. IOW, when there's an error, will it be in the code itself or in the scenario that code was supposed to be representing?
I think the intuition that lots of people jumped to early about how "specs are the new code" was always correct, but at the same time it was absolutely nuts to think that specs can be represented in good ways with natural language and bullet-lists in markdown. We need chain-of-spec that's leveraging something semi-formal and then iterating on that representation, probably with feedback from other layers. Natural-language provides constraints, guess-and-check code generation is sort at the implementation level, but neither are actually the specification which is the heart of the issue. A perfect intermediate language will probably end up being something pretty familiar that leverages and/or combines existing formal methods from model-checkers, logic, games, discrete simulations, graphs, UML, etc. Why? It's just very hard to beat this stuff for compression, and this is what all the "context compaction" things are really groping towards anyway. See also the wisdom about "programming is theory building" and so on.
I think if/when something like that starts getting really useful you probably won't hear much about it, and there won't be a lot of talk about the success of hybrid-systems and LLMs+symbolics. Industry giants would have a huge vested interest in keeping the useful intermediate representation/languages a secret-sauce. Why? Well, they can pretend they are still doing something semi-magical with scale and sufficiently deep chain-of-thought and bill for extra tokens. That would tend to preserve the appearance of a big-data and big-computing moat for training and inference even if it is gradually drying up.
Perhaps something like TLA+ or PlusCal specs could be the specs in terms of 'specs are the new code'.
Delusional vibe coding bullshit. Find me one significant software project based on using natural language for the software.
DSPy implemented program of thought since a long time ago and it works great to solve user queries with code.
What is great is that you can define DSPy signature of the type “question, data -> answer” where “data” is a pandas dataframe, then DSPy prompts the llm to answer the question using the data and python code. Extremely powerful.
This seems to be incorporated into current LLM generations already -- when code execution is enabled both GPT-5.x and Claude 4.x automatically seem to execute Python code to help with reasoning steps.
This was integrated in gpt4 2 years ago:
https://www.reddit.com/r/ChatGPT/comments/14sqcg8/anyone_els...
I remember seeing that GPT-5 had two python tools defined in its leaked prompt one them would hide the output from user visible chain of thought UI.
Same with CoT prompting.
If you compare the outputs of a CoT input vs a control input, the outputs will have the reasoning step either way for the current generation of models.
Yeah, this is honestly one of the coolest developments of new models.
And even before this work, there was "PAL: Program-aided Language Models" (https://arxiv.org/abs/2211.10435, https://reasonwithpal.com/).
Afaik PaLM (Google's OG big models) tried this trick, but it didn't work for them. I think it's because PaL used descriptive inline comments + meaningful variable names. Compare the following:
```python
# calculate the remaining apples
apples_left = apples_bought - apples_eaten
```
vs.
```python
x = y - z
```
We have ablations in https://arxiv.org/abs/2211.10435 showing that both are indeed useful (see "Crafting prompts for PAL").
Underlying paper is from 2022 and should be indicated in the title.
Worth noting: this paper was published three days before the release of GPT-3.5
Steve Krouse had an amazing rant two weeks back on/against MCP, and how asking AI to write code to call MCP servers has eaten away at actually calling tools. It feels similar, of code being a more grounded system. https://x.com/stevekrouse/status/1988641250329989533
Anthropic recently added this to the API: https://www.anthropic.com/engineering/advanced-tool-use
See "Programmatic Tool Calling"
And there was an AI productivity startup called Lutra AI doing this, although they've since pivoted to some kind of MCP infra thing: https://lutra.ai/
https://www.reddit.com/r/ChatGPT/comments/14sqcg8/anyone_els...
That is different, it's a code interpreter where the model can run code and see the outputs. It's not a different way of doing tool calls
What is "program-of-thought" ?
> What is "program-of-thought" ?
"One of the hardest things in programming?" for $1000, Alex.
chain of shit. learn Prolog, bois.
“Statistical matrix math outperforms statistical matrix math!” More at 11
I call that self-destructive prompting in the sense that you use AI to output programs that replace calling the AI in the future. The paper seems to indicate that this also brings much better results. However it's subject to attacks as running generated code is usually unsafe. A sandbox has to be used, major agentic AI players are providing some solutions, like Langchain sandbox released earlier this year.