Love the p_step^N framing — maps cleanly to agentic chains where
errors compound. Worth naming a second reliability failure that
sits one layer below, in single-turn: confident-absent.
Ran "what's the best X for Y?" across 6 LLMs (ChatGPT, Perplexity,
Gemini, Claude, DeepSeek, Mistral) for ~200 B2B SaaS tools across
34 categories. In 60%+ of categories, models converge on the same
"default three" and everything else is effectively invisible. Not
wrong — just erased. Single-turn, so it never shows up in p_step^N.
A verification layer catches "false." But there's no layer catching
"the space of correct answers was silently pruned." Curious if your
framework could be extended from correctness per step to coverage
per response.
The framing of p_step^N is useful, but it points to a deeper architectural problem: verification fails because it samples from the same distribution as the generator. The real fix isn't better prompting — it's independent verification with uncorrelated error distributions.
This maps directly to institutional governance problems. A decision made by a single agent with no memory of prior decisions, no reputation weight, and no contextual history of outcomes will fail the same way — not randomly, but systematically, in the same direction.
Persistent memory reduces N by eliminating context reconstruction at each session. Reputation-weighted voting creates genuinely independent verification — an agent with a strong track record samples from a different distribution than a new one. And outcome contextualization feeds results back into the next cycle rather than discarding them.
The author identifies the problem precisely. The solution isn't a better prompt — it's a different architecture.
curious what you're measuring as 'reliability' here. is it consistency across runs, accuracy vs confidence, or something else? the visualization angle is cool, id like to see how you handle the variance across different model sizes
In the thinking output, it isn't really discrete. Therefore I can't apply series-system reliability formulas. But those are a very good metaphor. I can see in the thinking output [1 + 1 = 2], [10 + 1 = 11] , [20 + 1 = 21], ... [N]. So the metaphor is what is the probability of each being correct is the probability of the agent solving the equation. If each bracket is right 95% of the time, a 10-step chain finishes correctly 0.95^10 ≈ 60% of the time.
So I started a climb at 5 digits X 5 digits, 6 digits X 6 digits, ..... N digits X N digits. There is clean decrease in reliability that the agent will get the correct answer with a cliff where it will always fail.
Model Last pass First fail
------ --------- ----------
Haiku 10 digits 12 digits
Sonnet 30 digits 33 digits
Opus 50 digits 52 digits
I have an agent that reverse engineers any website creating an API that is optimized to use the least amount of resources to interact with the website. The agent also writes its self every iteration -- a recursive agent.
The agent will update its own instructions and will run an evaluation (I hate that they stole the word harness because it is a test harness) against 5 different extremely difficult to reverse engineer websites like Ticketmaster, Youtube, Twitch, ect..
Each evaluation whether it passes finding all the endpoints including streaming, graphql, and websockets, the number of tokens is tracked and the amount of time. There is nothing deterministic about it but it IS PROBABILISTIC meaning with the same prompt the chance of passing and with how many tokens is a distribution.
I trying very hard to build a mental model of how this all works.
Love the p_step^N framing — maps cleanly to agentic chains where errors compound. Worth naming a second reliability failure that sits one layer below, in single-turn: confident-absent.
Ran "what's the best X for Y?" across 6 LLMs (ChatGPT, Perplexity, Gemini, Claude, DeepSeek, Mistral) for ~200 B2B SaaS tools across 34 categories. In 60%+ of categories, models converge on the same "default three" and everything else is effectively invisible. Not wrong — just erased. Single-turn, so it never shows up in p_step^N.
A verification layer catches "false." But there's no layer catching "the space of correct answers was silently pruned." Curious if your framework could be extended from correctness per step to coverage per response.
The framing of p_step^N is useful, but it points to a deeper architectural problem: verification fails because it samples from the same distribution as the generator. The real fix isn't better prompting — it's independent verification with uncorrelated error distributions. This maps directly to institutional governance problems. A decision made by a single agent with no memory of prior decisions, no reputation weight, and no contextual history of outcomes will fail the same way — not randomly, but systematically, in the same direction. Persistent memory reduces N by eliminating context reconstruction at each session. Reputation-weighted voting creates genuinely independent verification — an agent with a strong track record samples from a different distribution than a new one. And outcome contextualization feeds results back into the next cycle rather than discarding them. The author identifies the problem precisely. The solution isn't a better prompt — it's a different architecture.
curious what you're measuring as 'reliability' here. is it consistency across runs, accuracy vs confidence, or something else? the visualization angle is cool, id like to see how you handle the variance across different model sizes
There are some problems reasoning about it.
In the thinking output, it isn't really discrete. Therefore I can't apply series-system reliability formulas. But those are a very good metaphor. I can see in the thinking output [1 + 1 = 2], [10 + 1 = 11] , [20 + 1 = 21], ... [N]. So the metaphor is what is the probability of each being correct is the probability of the agent solving the equation. If each bracket is right 95% of the time, a 10-step chain finishes correctly 0.95^10 ≈ 60% of the time.
So I started a climb at 5 digits X 5 digits, 6 digits X 6 digits, ..... N digits X N digits. There is clean decrease in reliability that the agent will get the correct answer with a cliff where it will always fail.
I have an agent that reverse engineers any website creating an API that is optimized to use the least amount of resources to interact with the website. The agent also writes its self every iteration -- a recursive agent.The agent will update its own instructions and will run an evaluation (I hate that they stole the word harness because it is a test harness) against 5 different extremely difficult to reverse engineer websites like Ticketmaster, Youtube, Twitch, ect..
Each evaluation whether it passes finding all the endpoints including streaming, graphql, and websockets, the number of tokens is tracked and the amount of time. There is nothing deterministic about it but it IS PROBABILISTIC meaning with the same prompt the chance of passing and with how many tokens is a distribution.
I trying very hard to build a mental model of how this all works.