Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.
Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.
Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.
You still have to worry about misconfigured local models. Even the professionals get it wrong, which is why local model performance is uneven across providers.
I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.
I have noticed this degradation of 5.5 reliability to what, in my experience, I consider Claude-level of reliability since early June.
My journey dealing with this has been transitioning from 5.5 high to 5.5 xhigh to 5.4 high.
5.4 high has been perfectly reliable for me for the last 3 weeks, and I am happy there.
Occasionally, I run some tasks on 5.5 xhigh to check if it has gone back to being 100% perfectly reliable, but, at this point, I am assuming they are just counting on releasing 5.6 rather than dealing with this reliability issue.
I've switched 3 months ago to Codex because Claude got incredibly stupid. 6 months ago vice versa. It doesn't matter if you use Codex or Claude. Both will fuck with you at some point. Though Codex probably less.
How do you mean OpenAI lets you use your own harness? I'm under the impression that a custom harness requires the OpenAI SDK, which requires api tokens rather than plus/pro accounts.
"A little secret. About 5% of our production traffic is on the Pi harness, about another 5% is on OpenCode. Reminder you can use your ChatGPT account in a flourishing set of other tools.
We’ll continue to make Codex awesome, but you have options."
Not just harnesses, you can even use the subscription in CI/CD. That, plus the fact that web chat does not count toward the same limits, is why I think the Codex personal plan is easily 10x the value of Claude Code.
You must've missed the OpenAI's response to Anthropic forcing everyone to their own harness if they want subscription pricing: official endorsement of custom harnesses like opencode and pi even when used with Codex subscription.
I think they even partnered with opencode or something like that (don't remember).
i don't ever believe these issues are technical. They're business decisions to downgrade performance because to fix it means $$$$ and you arn't paying them enough.
Deja Vu... This looks just like the Claude Code performance regression back in April. I just quit my Claude subscription when that happened and went to Codex.
Now I'm kinda thinking of trying per token for both, using GLM 5.2 on Fireworks for most tasks, shelling out to the big boys only when needed. Not totally confident I'll break even though.
Re per token, I had the same reaction, but given both labs are economically advantaged moving customers to per-token consumption... almost want to avoid this on principle. Even if not intentional, benefitting from a degraded product is not something I want to accept or enable.
More now than ever (since original ChatGPT release), the OSS models and open harnesses (eg Pi) are looking mighty attractive.
Right? I also quit Claude Code and switch to Codex over that. Now I’m trying to figure out how I could make an extra $65,000 to never have to be concerned about this nonsense again. I know the economics of using open router etc…
But I’m reminded of ~2008 and the rise of “the cloud” as a marketing term that seemed to me to be a cover for dropping an expectation of rich clients, increasing a companies margins around subscriptions that would chip away at local ownership.
Then I got offput by the zealotry and absolutism around “true FoSS”, told myself I was young and moved on.
And really, a lot of subscription models I kind of can appreciate/ tolerate. Might be irksome but whatever, I get that software is expensive to make and it’s not fair in 2026 to value a yearly upgrade of Photoshop at $200. The capricious UI changes to things that’ve worked for 20 years and they take away say the classic color swatches altogether - silly and dumb.
I can use another professionally necessary tool I pay $200/ mo for, Codex, to whip up a classic swatch plugin.
Is that $200 a fair price for my token usage? I think an extremely heavy month I might’ve used a billion tokens?
But that right there is the problem. They have no idea what, specifically, profitability looks like and are going to be pulling endless levers for … I genuinely have no idea how long - at least through 2030/2032 if we tea leaves their debt obligations?
I don’t want to think about any of that. At all. I don’t want to spend time evaluating model preference and degradation and updating the nuances of how I “speak” to an AI because there’s some mystery backend experiment running on the output I use to produce functional outputs — ie the actual products I get paid to build/ maintain.
AI’s something between a tool and coworking companion, and the capricious “personality” changes due to playing with poorly understood and knobs and levers at the inference level - is maddening. To that end, I want a box in the corner I can point to and know exactly the quality of outputs that no one but myself modifies.
The vibe-assumed claude code performance regression, yep. People should stop expecting consistent performance from non-deterministic systems. There is zero empirical corroboration of performance degredation.
There has been a step change... in the amount of whining and complaining coders exhibit lately.
When I disagree with the data: I will nitpick every last detail of methodology,
any cross-corroboration is an anecdote, suddenly I demand a-priori levels of
justification. All science is flawed anyways, it's not like mathematics, you
can't get absolute certainty, so why bother? You're always going to be making
base assumptions that can be challenged, you're necessarily going to abstract
out the territory, the map is flawed.
When I agree with the data: I will boast about the victories of science and
empiricism, we found the perfect set of natural abstractions that are necessary
and sufficient to map out the territory that carve at the joints of the
problem, any concern about assumptions is rebutted with generic "Well, we're
just pragmatists; we're not perfect, but clearly we're converging on the right
direction! You're clearly someone who just wants to nitpick and not get any
work done."
My experience with certain hackernews commenters in a nutshell.
Maybe its just bad memory but I feel like 5.3 was the best version in terms of token usage and code quality. 5.5 works better but it just eviscerates tokens.
That was an article in The Information but it didn't read very well to me, I didn't get the impression the author was enough of a technical expert on how LLMs work to credibly evaluate the claim, which came from an insider rumor: https://www.theinformation.com/newsletters/ai-agenda/openai-...
> OpenAI engineers earlier this month told some colleagues they had figured out a way to more than halve the cost of inference, or running existing models, thanks to some newly-discovered optimizations, according to a person with knowledge of those discussions.
I bet that since this bug has made headlines, there are some panicked engineers at OpenAI desperately trying to figure out how to fix it without undoing their “magic optimisation”.
My understanding of the rumor is that it wasn't OpenAI itself, but one of the post-blip OpenAI breakaway groups (rumoured to be Thinking Machines) who have made a breakthrough and seem to be shopping it to OpenAI. I don't think it has actually been implemented by OpenAI yet.
It seems to be an inference engine or agent harness defect/misconfig rather. Not only do the issue details not evidence a willful stealth nerf, they actively suggest otherwise: the root cause is crude, and evidently not particularly stealthy (as it's being reported on by a regular user with independently verifiable, exact details).
I don't find "usual user psychosis" particularly fair or tasteful anyhow. You're not left with much more than subjective judgement and speculation/suspicion when all you have is a magic sink of an API endpoint that ingests your context window then spits back a continuation of it. Even if you have a standardized model test suite, claiming a stealth nerf remains an exercise in mind reading (of the people working there). Model quality can degrade without an explicit intention that way, or a downgrade of the underlying infrastructure, after all.
Being tongue-in-cheek conspiratorial, or even actually entertaining the possibility of a nerf, is no psychosis anyways. Not a fan of this trend of people abusing psychology diagnosis terminology like this. I'm sure there are people who go a step beyond and are overconfident in these judgements, maybe in their case it holds. But then that's a minority, and so what you have then is a hyperboly. Doesn't serve anyone.
"They made the model dumber" on literally the same checkpoint with the same prompt on the same quantization running on the same hardware is a staple of AI complaints.
Users are completely incapable of objectively evaluating model quality over time.
Which makes it all the harder to notice actual "stealth nerfs", misconfigurations or other technical issues. Because "they made the model DUMBER, for REAL this time" is background noise.
How are you so sure that frontier API models are always running the same quant/weights/etc? You think OpenAI and Anthropic are running essentially just vLLM endpoints? Of course not.
This can be verified pretty quickly like OP — count the token metrics, if your context contains classifier-firing terms, you’ll see input_tokens being higher than your input.
So if they’re already doing that, what makes you think it’s just a dumb API, instead of a complicated pipeline filled with trade secrets and optimisations?
Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.
This "~512 batching" makes me think of things like diffusion or prefill.
If they managed to put together some dirty hack that lets them generate about 512 tokens worth of reasoning in parallel instead of in sequence? That would explain it.
this explains so much why gpt 5.5 has been so bad lately it was really puzzling why it struggled so much where when it first came out it was one shotting stuff totally amazing, i tried the prompt that will tell you if your plan is degraded:
codex exec --json --skip-git-repo-check --ephemeral -s read-only --disable memories -m gpt-5.5 -c model_reasoning_effort=high "Do not use external tools. A black bag contains candies with counts: round apple 7, round peach 9, round watermelon 8; star apple 7, star peach 6, star watermelon 4. Shape is distinguishable by touch before drawing; flavor is not. What is the minimum number of candies to draw to guarantee having apple and peach candies of different shapes, i.e. round apple + star peach or round peach + star apple? Give reasoning and final number. The local project dir is irrelevant for this task, do not consult it. "
1. 516, 24
2. 516, 27
3. 516, 12
4. 516, 21
5. 516, 21
This means that the whole time we've been paying for a product that was silently routing to something completely different and inferior from gpt 5.5
Also I read through the github issues and it seems like they closed a previous issue without addressing it ???!!
whooo boy somebody from OpenAI is getting fired over this if not a class action lawsuit is almost guaranteed at this point.
It's been a month I've been using it as they gave me for free, and I found GPT-5 on Codex quite weird/awful. Even x-high. Then I figured out I should try OMP (Pi), and the experience was much better.
Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.
I'm struggling as well to understand, and I think perhaps they mean they use ChatGPT website with GPT-5.5+reasoning for problem solving, and paste the output into Codex CLI/App. I think they're saying that letting Codex CLI/App problem solve with GPT-5.5 isn't as effective. Essentially that the web harness is superior to the agentic engineering harness for problem solving?
Not sure if I agree, but I do happen to use a fair bit of web harness as well, just because I find it to be much more effective at web search and a different type of reasoning. So I must agree a little or else I wouldn't do that.
I assume they are lying and still think you can use gpt 5.5 non-codex within codex cli. And they outed themselves. A lot of nonsense. And the very poor communication skills just seem like the typical chinese astroturfing you see pretty often now when discussing OAI/Claude.
See, this is part of the confusion. There is no such thing as "GPT-5.5-codex". The last codex-branded model was "GPT-5.3-codex". Starting with "GPT-5.4" the main model handles agentic engineering and they did not release a coding model.
Both the web harness and codex app/cli use "GPT-5.5".
I know that these types of comments are not really popular here, but this struck a chord with me because I feel the same. They aren't remotely close.
I have codex right now purely because they gave me a month free of ChatGPT Pro, so I have been using it in between my usage resets with claude. Since it's "free money" for me I have been using it exclusively on xHigh.
One of my most frequent prompts is "hey codex worked on ____, but it didn't quite hit the mark, can we review the work..."
Yes, part of this is normal even within the same model -- you have the highest power model review the work for correctness, refactoring opportunities, and so on, but man I tell you, I don't know what it is about codex, this is obviously one guy's anecdote -- same prompting style, same repository documentation ala MD files, same skills, way different results.
All that to say, maybe the bug report is on to something here, and it can be fixed.
Oh this seems bad, and is fairly easy to reproduce using codex cli. You give it a puzzle prompt that it has to reason about and solve, occasionally it will seemingly short circuit and think for exactly 516 tokens, and return the wrong result. When it ends up using 6000-8000 thinking tokens it returns the correct result.
Maybe some issue with adaptive thinking? Another point for local models I guess, don't have to worry about silent server side changes.
Edit: To follow up, it seems to happen quite often. Out of 10 runs of the exact same prompt, 4/10 had this 516 thinking token issue, and every one of these had the wrong solution. So nearly half the time, 5.5 xhigh could be short circuiting and degrading performance. Granted the sample size is small.
You still have to worry about misconfigured local models. Even the professionals get it wrong, which is why local model performance is uneven across providers.
I wonder if testing during different time/days show patterns? For example, whether the short circuiting happens more often during workday peak hours.
I love that Codex is open source and issues like these can surface/be addressed publicly.
I’ve definitely experienced step jumps down in quality on an almost daily basis. I usually used xhigh. The experience of relying on codex’s outstandingly thorough coding earlier in the year has evaporated for me. I’m seeing incredibly stupid implementations intermittently, and have simply switched to Claude until openai takes the issue seriously. As far as i could tell they haven’t taken it seriously for the several months I’ve been personally seeing it.
I have noticed this degradation of 5.5 reliability to what, in my experience, I consider Claude-level of reliability since early June.
My journey dealing with this has been transitioning from 5.5 high to 5.5 xhigh to 5.4 high.
5.4 high has been perfectly reliable for me for the last 3 weeks, and I am happy there.
Occasionally, I run some tasks on 5.5 xhigh to check if it has gone back to being 100% perfectly reliable, but, at this point, I am assuming they are just counting on releasing 5.6 rather than dealing with this reliability issue.
I've switched 3 months ago to Codex because Claude got incredibly stupid. 6 months ago vice versa. It doesn't matter if you use Codex or Claude. Both will fuck with you at some point. Though Codex probably less.
At least OpenAI lets me use my own harness. Having to rely on insane PMs letting Claude Mythos go wild on the codebase has not been going well lately.
How do you mean OpenAI lets you use your own harness? I'm under the impression that a custom harness requires the OpenAI SDK, which requires api tokens rather than plus/pro accounts.
https://x.com/thsottiaux/status/2058071172361998482
"A little secret. About 5% of our production traffic is on the Pi harness, about another 5% is on OpenCode. Reminder you can use your ChatGPT account in a flourishing set of other tools.
We’ll continue to make Codex awesome, but you have options."
Not just harnesses, you can even use the subscription in CI/CD. That, plus the fact that web chat does not count toward the same limits, is why I think the Codex personal plan is easily 10x the value of Claude Code.
https://developers.openai.com/codex/auth/ci-cd-auth
You must've missed the OpenAI's response to Anthropic forcing everyone to their own harness if they want subscription pricing: official endorsement of custom harnesses like opencode and pi even when used with Codex subscription.
I think they even partnered with opencode or something like that (don't remember).
Anthropic is the one that prohibits harnesses other than Claude Code on subscription plans and bans users for disobeying.
OpenAI officially allows that with subscriptions.
If you use GitHub Copilot you can switch between them in the same session if you want.
Yeah but now that you pay for tokens that's going to be bad for token caching.
Same with a third-party open harness.
i don't ever believe these issues are technical. They're business decisions to downgrade performance because to fix it means $$$$ and you arn't paying them enough.
Deja Vu... This looks just like the Claude Code performance regression back in April. I just quit my Claude subscription when that happened and went to Codex.
Now I'm kinda thinking of trying per token for both, using GLM 5.2 on Fireworks for most tasks, shelling out to the big boys only when needed. Not totally confident I'll break even though.
Re per token, I had the same reaction, but given both labs are economically advantaged moving customers to per-token consumption... almost want to avoid this on principle. Even if not intentional, benefitting from a degraded product is not something I want to accept or enable.
More now than ever (since original ChatGPT release), the OSS models and open harnesses (eg Pi) are looking mighty attractive.
Right? I also quit Claude Code and switch to Codex over that. Now I’m trying to figure out how I could make an extra $65,000 to never have to be concerned about this nonsense again. I know the economics of using open router etc…
But I’m reminded of ~2008 and the rise of “the cloud” as a marketing term that seemed to me to be a cover for dropping an expectation of rich clients, increasing a companies margins around subscriptions that would chip away at local ownership.
Then I got offput by the zealotry and absolutism around “true FoSS”, told myself I was young and moved on.
And really, a lot of subscription models I kind of can appreciate/ tolerate. Might be irksome but whatever, I get that software is expensive to make and it’s not fair in 2026 to value a yearly upgrade of Photoshop at $200. The capricious UI changes to things that’ve worked for 20 years and they take away say the classic color swatches altogether - silly and dumb.
I can use another professionally necessary tool I pay $200/ mo for, Codex, to whip up a classic swatch plugin.
Is that $200 a fair price for my token usage? I think an extremely heavy month I might’ve used a billion tokens?
But that right there is the problem. They have no idea what, specifically, profitability looks like and are going to be pulling endless levers for … I genuinely have no idea how long - at least through 2030/2032 if we tea leaves their debt obligations?
I don’t want to think about any of that. At all. I don’t want to spend time evaluating model preference and degradation and updating the nuances of how I “speak” to an AI because there’s some mystery backend experiment running on the output I use to produce functional outputs — ie the actual products I get paid to build/ maintain.
AI’s something between a tool and coworking companion, and the capricious “personality” changes due to playing with poorly understood and knobs and levers at the inference level - is maddening. To that end, I want a box in the corner I can point to and know exactly the quality of outputs that no one but myself modifies.
Fireworks?
Provides access to AI models for a per-token fee. See OpenRouter, they are one of many.
The vibe-assumed claude code performance regression, yep. People should stop expecting consistent performance from non-deterministic systems. There is zero empirical corroboration of performance degredation.
There has been a step change... in the amount of whining and complaining coders exhibit lately.
If you bother to look at the issue instead of whining and complaining, you will see the evidence.
When I disagree with the data: I will nitpick every last detail of methodology, any cross-corroboration is an anecdote, suddenly I demand a-priori levels of justification. All science is flawed anyways, it's not like mathematics, you can't get absolute certainty, so why bother? You're always going to be making base assumptions that can be challenged, you're necessarily going to abstract out the territory, the map is flawed.
When I agree with the data: I will boast about the victories of science and empiricism, we found the perfect set of natural abstractions that are necessary and sufficient to map out the territory that carve at the joints of the problem, any concern about assumptions is rebutted with generic "Well, we're just pragmatists; we're not perfect, but clearly we're converging on the right direction! You're clearly someone who just wants to nitpick and not get any work done."
My experience with certain hackernews commenters in a nutshell.
This is evidence of a bug, not the purposeful enshittification people are referencing
Maybe its just bad memory but I feel like 5.3 was the best version in terms of token usage and code quality. 5.5 works better but it just eviscerates tokens.
They rendered 5.3 unusable for me a few weeks back. It simply was locking up or answering poorly.
I swear some days ago someone here claimed Openai succeeded cutting down their compute cost by half with a breakthrough optimization. So this is it?
That was an article in The Information but it didn't read very well to me, I didn't get the impression the author was enough of a technical expert on how LLMs work to credibly evaluate the claim, which came from an insider rumor: https://www.theinformation.com/newsletters/ai-agenda/openai-...
> OpenAI engineers earlier this month told some colleagues they had figured out a way to more than halve the cost of inference, or running existing models, thanks to some newly-discovered optimizations, according to a person with knowledge of those discussions.
I bet that since this bug has made headlines, there are some panicked engineers at OpenAI desperately trying to figure out how to fix it without undoing their “magic optimisation”.
My understanding of the rumor is that it wasn't OpenAI itself, but one of the post-blip OpenAI breakaway groups (rumoured to be Thinking Machines) who have made a breakthrough and seem to be shopping it to OpenAI. I don't think it has actually been implemented by OpenAI yet.
A rare case "they made the model dumber" where they actually made the model dumber, instead of the usual user psychosis?
It seems to be an inference engine or agent harness defect/misconfig rather. Not only do the issue details not evidence a willful stealth nerf, they actively suggest otherwise: the root cause is crude, and evidently not particularly stealthy (as it's being reported on by a regular user with independently verifiable, exact details).
I don't find "usual user psychosis" particularly fair or tasteful anyhow. You're not left with much more than subjective judgement and speculation/suspicion when all you have is a magic sink of an API endpoint that ingests your context window then spits back a continuation of it. Even if you have a standardized model test suite, claiming a stealth nerf remains an exercise in mind reading (of the people working there). Model quality can degrade without an explicit intention that way, or a downgrade of the underlying infrastructure, after all.
Being tongue-in-cheek conspiratorial, or even actually entertaining the possibility of a nerf, is no psychosis anyways. Not a fan of this trend of people abusing psychology diagnosis terminology like this. I'm sure there are people who go a step beyond and are overconfident in these judgements, maybe in their case it holds. But then that's a minority, and so what you have then is a hyperboly. Doesn't serve anyone.
"They made the model dumber" on literally the same checkpoint with the same prompt on the same quantization running on the same hardware is a staple of AI complaints.
Users are completely incapable of objectively evaluating model quality over time.
Which makes it all the harder to notice actual "stealth nerfs", misconfigurations or other technical issues. Because "they made the model DUMBER, for REAL this time" is background noise.
How are you so sure that frontier API models are always running the same quant/weights/etc? You think OpenAI and Anthropic are running essentially just vLLM endpoints? Of course not.
Firstly, we know Anthropic has been doing prompt injection into their 1P APIs (not bedrock/vertex AFAIK) for at least a year now. https://old.reddit.com/r/ClaudeAI/comments/1f6hcwo/injection...
This can be verified pretty quickly like OP — count the token metrics, if your context contains classifier-firing terms, you’ll see input_tokens being higher than your input.
So if they’re already doing that, what makes you think it’s just a dumb API, instead of a complicated pipeline filled with trade secrets and optimisations?
Clearly they are batching reasoning inference in a few multiples of 512 tokens as a throughput optimization
Isn't the standard to use continuous batching? If they are using continuous batching -- I'm curious why generated token length matters, and why they might be clustering them. If not -- I'm curious why they aren't and what is the tradeoff here.
This "~512 batching" makes me think of things like diffusion or prefill.
If they managed to put together some dirty hack that lets them generate about 512 tokens worth of reasoning in parallel instead of in sequence? That would explain it.
Reset!
The good experience I had with GPT-5.5 before made me upgrade to Pro this month. Now I want a refund.
Sounds like a problem with the drafter.
this explains so much why gpt 5.5 has been so bad lately it was really puzzling why it struggled so much where when it first came out it was one shotting stuff totally amazing, i tried the prompt that will tell you if your plan is degraded:
1. 516, 242. 516, 27
3. 516, 12
4. 516, 21
5. 516, 21
This means that the whole time we've been paying for a product that was silently routing to something completely different and inferior from gpt 5.5
Also I read through the github issues and it seems like they closed a previous issue without addressing it ???!!
whooo boy somebody from OpenAI is getting fired over this if not a class action lawsuit is almost guaranteed at this point.
tldr:
GPT-5.5 Codex model exhibits a clustering phenomenon in which reasoning_output_tokens cluster at fixed values spaced 518 apart.
These stuck responses at fixed thresholds are strongly correlated with errors in complex tasks.
Observed phenomenon is specific to GPT-5.5; it is much less prevalent in GPT-5.4 and almost absent in GPT-5.2 and 5.3
It's been a month I've been using it as they gave me for free, and I found GPT-5 on Codex quite weird/awful. Even x-high. Then I figured out I should try OMP (Pi), and the experience was much better.
I remember GPT 5.2 Codex being fine...
Does this affect the Codex app too, or just the Codex CLI tool?
From some of the numbers I'm seeing in the GitHub issue, the codex desktop app has the same 516 spikes. So most likely it is affected.
Personally, I would say very likely, to be honest. I gotta go through this a little more, but I actually use 5.5 codex an obscene amount, and I almost never use it for reasoning anymore. It's not even in the same galaxy as far as actually taking out the thinking and using GPT-5.5 or even Claude and then coming back and giving it the reasoning. Blah blah blah, it's the same model. Well, let me tell you, no, it's not, for several reasons, and the delta on intelligence is pretty staggering.
Care to explain what you mean by that?
I'm struggling as well to understand, and I think perhaps they mean they use ChatGPT website with GPT-5.5+reasoning for problem solving, and paste the output into Codex CLI/App. I think they're saying that letting Codex CLI/App problem solve with GPT-5.5 isn't as effective. Essentially that the web harness is superior to the agentic engineering harness for problem solving?
Not sure if I agree, but I do happen to use a fair bit of web harness as well, just because I find it to be much more effective at web search and a different type of reasoning. So I must agree a little or else I wouldn't do that.
I assume they are lying and still think you can use gpt 5.5 non-codex within codex cli. And they outed themselves. A lot of nonsense. And the very poor communication skills just seem like the typical chinese astroturfing you see pretty often now when discussing OAI/Claude.
See, this is part of the confusion. There is no such thing as "GPT-5.5-codex". The last codex-branded model was "GPT-5.3-codex". Starting with "GPT-5.4" the main model handles agentic engineering and they did not release a coding model.
Both the web harness and codex app/cli use "GPT-5.5".
haha woops. guess im the chinaman now
What do you mean by that? Seems kinda racist.
I know that these types of comments are not really popular here, but this struck a chord with me because I feel the same. They aren't remotely close.
I have codex right now purely because they gave me a month free of ChatGPT Pro, so I have been using it in between my usage resets with claude. Since it's "free money" for me I have been using it exclusively on xHigh.
One of my most frequent prompts is "hey codex worked on ____, but it didn't quite hit the mark, can we review the work..."
Yes, part of this is normal even within the same model -- you have the highest power model review the work for correctness, refactoring opportunities, and so on, but man I tell you, I don't know what it is about codex, this is obviously one guy's anecdote -- same prompting style, same repository documentation ala MD files, same skills, way different results.
All that to say, maybe the bug report is on to something here, and it can be fixed.
What?