Writing too much at once with under specified prompts.
If you stick to targeted problems with well-described prompts, acceptance criteria, and lots of linting, unit testing, and integration testing, you'll typically get what you want with code that looks okay. And when things start to stray, it's easy to get things back on track.
It's when you start trying to have LLMs write too much without a human review that you start getting unnecessary function chains, abstractions that aren't needed, code that doesn't really match the existing style, duplicate code, missing functionality, hallucinated functionality, tautological tests, etc.
It works best when there's regular feedback in the loop about what's good and what's not good. Testing and linting can fill in some of that, but we still need a human in the loop with "taste", so to speak.
When I ask the LLM to try and solve a problem that turns out to be difficult or impossible to solve, I've found it will absolutely lose the plot.
I feel like a human would give up a lot quicker and start to learn where the limits are. Claude spins in circles convinced it's finally found a solution. Again. And again. And eventually gets back to where it started.
It fails to port from cpp and cuda code to python code. In particular this repo: https://github.com/weigao95/surfelwarp . I really thought with the current level of Claude/Codex it would be easy.
I was working with Claude on a Chrome extension. The extension was getting a 429 "Too many requests" error on one website. Claude suggested a bunch of things to try, none of which really solved the problem and were kind of one-off attempts (hardcoded string compares, etc.).
Eventually I asked it "hey, are you sending two requests when you could send one?" Claude thought about it for a minute and said, "you're right! Let me fix that." The 429 errors stopped.
I've found it really is more like pair programming than having another fully independent developer. For Jenkins pipelines, I don't care about hardcoded string compares as much. For the core capability of the software, details are important.
Codebases that are too big for the context window and not properly isolated modules. It can't keep track of everything.
Also any situation where the context window is even remotely close to being full. At 80% the degradation is noticeable enough to make me start from scratch
Layers of abstraction. Most noticeably with inheritance and general OOP concepts. I've tried to force it, assuming it prefers a more functional or simple class style; but it genuinely struggles to generate (but not understand) what I might call a typical system in an OOP paradigm with well defined abstractions.
That depends on how many details I specify.
If I specify a lot, I usually get what I want. But in the extreme this is just another form of coding (high quality code is quite similar to a detailed spec).
In many cases I find I have to do many "passes" to get the right balance of correctness, performance, security, and clean architectural boundaries.
Having a loop to fix these often makes it worse since they can often be contradictory.
There's also some types of code that I believe is often wrong in the training data that is almost always wrong in the LLM output as well.
Typically anything that should have been a state machine, like auth flows, wizards, etc.
When all is said and done I think the main savings come from the high throughput of low-value generic solutions. I don't currently see this changing, and the reason is that high quality products cannot be generated without specifying a lot of details. Of course, we may not want quality.
It doesn't push back enough when I ask it to implement something that is a bad idea. I've caught it breaking threading code, and then fixing it by putting atomic in the wrong place: which might have fixed in that a few racey tests can now run a thousands times, but it wasn't the right place and so the race still exist. It has changed constants when moving the constant to a different file (the move was required, but the constant were not expected to change).
I have found it very helpful to ask AI to review the latest changes - it often finds serious problems in review of code it just wrote.
> which might have fixed in that a few racey tests can now run a thousands times, but it wasn't the right place and so the race still exist.
One somewhat related thing I've noticed from all the major LLMs is a tendency to 'fix' multithreading race conditions by inserting delay calls of seemingly random amounts of milliseconds to 'give things time to settle'.
It kinda makes sense why they do it because I've seen similar human generated code over the years, but it doesn't instill a lot of confidence in unreviewed 'vibecoded' systems.
I had a fun one where Opus 4.6 could not properly export a 3D model to a 3MF for multi-color 3D printing. Ultimately I ended up having it output each color individually and I just import them together into the slicer.
Respecting instructions around tool use, even on small codebases, where the tool isn't its favourite way of doing things.
For example, models repeatedly try and not use my Makefile's (build / run / test / test-file / check) and instead spend several cycles attempting to invoke using system tools and getting it wrong.
I've got to the point where I run most models with an extra path folder which overrides typically bad commands tells them off and redirects them back (more effective and portable than hooks). But then sometimes the model reads the makefile, discovers the workaround, and does the stupid anyway.
Most recently: Opus 4.6 screwed up using a common and well documented API (Qt), then when asked to debug the observed issues, blamed "a Qt bug" and wrote a whole layer on top of the API to work around the issue caused by its incorrect use of the API.
It did the above twice in a row around different parts of the API.
The thought that there are almost certainly devs out there merging Claude PRs without the skills or volition to push back on its screwups is not comfortable.
Recently working on implementing a "MS Teams for the terminal" (video conf, audio, chat, file sharing, recording, etc, the usual things you would expect). Linux and Mac, FBSD and others to follow. Prior to the 1M context window I found I had to restrict myself to specific functionality/areas of the codebase. I'd gotten lazy anyway so this was no bad thing. Reduced the "vibe" quotient of the AI coding.
Depends - in the pure technical "implementation" level, only limits I've found are the ones with outdated knowledge - libraries, platforms, availability of some things.
But one big limit is the DX. Their choice of DX is usually abysmal - ironically, just like an average devs. They seem to lack the aesthetic instinct for code, so you have to really point them hard into the direction or provide a sample of the expected DX, for them to still fight against it at every turn.
While understandable in a way, as they are trained on average code and most code will now be written by the machines anyways making the DX "less relevant", it's also a giant code smell, as bad DX tends to point towards bad internals and wrong decisions along the way.
So not really a technical limit - they swallow anything you throw at them, even the most complex cases - but more of an aesthetic limit in terms of taste.
- Plenty of API hallucination happening on cutting edge Spark (4.0.0+) functionality, especially PySpark. Spark bares some blame here for broken and incomplete documentation. Takes a human in the loop to realize that the documentation is misleading or wrong or missing.
Soft limit:
- API design. I’ve found that, unless specifically steered towards “good” API design (highly subjective), agents tend to just add another endpoint / function to satisfy the exact task at hand, with total disregard to how the rest of the API looks. (Pretty much exactly what a junior engineer would do…)
On this note, one thing I've found Codex to do is worry more than necessary about breaking changes for internal APIs. Maybe a bit more prompting would fix this, but I found even when iteratively implementing larger new features, it worries about breaking APIs that aren't used by anything but the new code yet.
That last part about it acting like a junior matches my experience very well. I'm using LLM's for refactoring, adding repetitive blocks of code, etc.
Unless I'm very clear at all times it will write code like the most annoying stubborn junior you've ever worked with. Nothing is sacred, everything can be abbreviated, shortened, made more confusing, made less readable, and concepts like readability or naming conventions are not even considered.
It also adds superfluous nonsense comments that don't explain the "why".
I think consistency levels when operating autonomously is still a challenge which you could refer to as their limit. You need to do so much around them to keep the consistency AND a decent level of performance. It's like a savant with a short-attention span.
I found sonnet 4.5 struggled with a two pointer interval merge (two sorted lists of things with start stop timestamps), but opus 4.5 managed. Then it took opus 4.6 to make it a three way or k way merge. So it reminded me of the classic Simple made Easy talk by Rich Hickey where he talks about braids. Sonnet couldn’t track two twisted braids and opus 4.6 managed a weave.
But the last few weeks Opus 4.6 seems to have got dumb again. Now it is making way more mistakes and forgetting useful things and recent context it used to manage.
I am guessing this is just Anthropic quietly dialling down the real effort as they either downsize or free up compute for someone or something else.
> But the last few weeks Opus 4.6 seems to have got dumb again. Now it is making way more mistakes and forgetting useful things and recent context it used to manage.
are you logging access patterns and times to see when the degradation occurs?
I find that AI models are very bad at doing performance work because they keep guessing how their changes affect things or not really understanding how the profiler results work, leading to them going in circles and taking forever. I have noticed this effect in surprisingly few lines of code (hundreds).
One thing I've found that I've found super helpful for this is converting profiling results to Markdown and feeding it back into the agent in a loop. I've done it with a bit of manual orchestration, but it could probably be automated pretty well. Specifically, pprof-rs[0] and pprof-to-md[1] have worked pretty well for me, YMMV.
Problems that require deep knowledge of multiple repositories, e.g. when trying to debug issues involving dependencies. The models get confused very fast even with all code available locally, due to the size of the problem. But in my experience any kind of deep integration already messes up the models, even within a single repo.
Pretty much anything that isn't Python, (obj)C(++)(#), Java(script) or other somewhat popular languages like swift, go, and rust. They can do other languages, but things get rough for anything complex
It's just really hard for them to write non-verbose code. I don't know if this is incentives from the providers to generate more tokens, but even with guidance on compact code, simple, etc, they just can't really do it right now.
Isn't this just an artifact of their chain-of-thought reasoning? If they are verbose in their output, it's more likely the next word predicted is actually correct.
Also, I see this more as a feature than a bug. In many projects I inherited in the past, I wish the original devs were a bit more verbose. Then again, with every developer using LLMs now, probably the opposite applies now.
I haven't found their limit, but I have found my limits, of waiting for them to do their thing...
I tend to re-enjoy handrolling code more. I delegate the stuff that annoys me.
We have a large multiplatform codebase, the issue seems to be more the time it takes to navigate the code and reason about it, rather than the size. Arguably the size is causing 'them' to be slower in that regards, but I haven't found the limit yet. And with compaction, it's even less of a problem.
I would push back on that question a little, because it has a baked in assumption about how these things work that conflict with my mental model and experience with them.
The reason is that sometimes it spits out something or does a workflow that's pretty sophisticated, and sometimes it fails spectacularly in the most basic ways.
I don't think there is a complexity or domain knowledge limit as there would be with a human. Or at least not in the same sense. As long as it can repeat and remix patterns that it is trained on, then it will do its thing well. The same seems to be true for "reasoning" loops and workflows. It can spit out code that has been done N times before in a similar manner for a large N.
They can still break down because of very trivial issues and assumptions that happen to be baked in, go off the rails and get stuck long loops that are completely insane if you think of them as imitating human programmers.
When I use an agent, I always interview it first about the task. Ask how they would go about it, probe them, give them info that they lack.
Never go from prompt to action. Have them define their approach first, then split the approach into pieces, from gathering data to cleaning it up and so on. If applicable, front-load work that can be achieved with scripts, so you have testable and repeatable steps rather than let it go wild.
So the TLDR is: I think the limitation is simply that it's a non-deterministic token machine that produces useful results enough of the time so it appears to be reasonable.
I hate the tendency to make things up and I dont mean hallucinations.
I once had claude code write a python script that emulated the output of my training script, including pretending that loss is decreasing. Why? Because it was unable to install a python dependency.
Everytime I use a coding agent, I need to double check that it's not cutting corners, hard coding things that shouldn't be or straight up rewriting failing test cases.
Their "this isn't working as expected" failure mode looks very familiar, from thousands of hours of dealing with outdated (and often just bad) internet recommendations over decades.
Last night the chatbot was unable to make a thing work, so it ended up spiraling and doing things like ripping all authentication out of a service. Then it just waved it off as, "it's just a dev service it'll be fine." I didn't tell it to rip all the authentication out, it just got frustrated. Yeah, I'm anthropomorphizing, but that's essentially how it went.
When the bot says, "I'm ready to throw something. Just rip it out and I'll deal with it another day." I can't help but sympathize.
They don't understand esoteric areas of computer science very well at all.
I had a mistake in which a large back-up file deletion event happened during a robocopy. 600gb of files got 'deleted' (file headers toast etc). Trying to get the LLMs to understand the hunt parameters, what to focus on, what not to focus on - none of them could reasonably come close to doing file content recovery properly. I needed to build a custom solution because the available industry options couldn't do what was required and the LLMs were useless for that (including the latest versions of Claude, Gemini and GPT). They just went around in circles, capped by their apparently weak knowledge of file recovery as a field. That is, creativity was their limitation.
Writing too much at once with under specified prompts.
If you stick to targeted problems with well-described prompts, acceptance criteria, and lots of linting, unit testing, and integration testing, you'll typically get what you want with code that looks okay. And when things start to stray, it's easy to get things back on track.
It's when you start trying to have LLMs write too much without a human review that you start getting unnecessary function chains, abstractions that aren't needed, code that doesn't really match the existing style, duplicate code, missing functionality, hallucinated functionality, tautological tests, etc.
It works best when there's regular feedback in the loop about what's good and what's not good. Testing and linting can fill in some of that, but we still need a human in the loop with "taste", so to speak.
When I ask the LLM to try and solve a problem that turns out to be difficult or impossible to solve, I've found it will absolutely lose the plot.
I feel like a human would give up a lot quicker and start to learn where the limits are. Claude spins in circles convinced it's finally found a solution. Again. And again. And eventually gets back to where it started.
It fails to port from cpp and cuda code to python code. In particular this repo: https://github.com/weigao95/surfelwarp . I really thought with the current level of Claude/Codex it would be easy.
I was working with Claude on a Chrome extension. The extension was getting a 429 "Too many requests" error on one website. Claude suggested a bunch of things to try, none of which really solved the problem and were kind of one-off attempts (hardcoded string compares, etc.).
Eventually I asked it "hey, are you sending two requests when you could send one?" Claude thought about it for a minute and said, "you're right! Let me fix that." The 429 errors stopped.
I've found it really is more like pair programming than having another fully independent developer. For Jenkins pipelines, I don't care about hardcoded string compares as much. For the core capability of the software, details are important.
Codebases that are too big for the context window and not properly isolated modules. It can't keep track of everything.
Also any situation where the context window is even remotely close to being full. At 80% the degradation is noticeable enough to make me start from scratch
Layers of abstraction. Most noticeably with inheritance and general OOP concepts. I've tried to force it, assuming it prefers a more functional or simple class style; but it genuinely struggles to generate (but not understand) what I might call a typical system in an OOP paradigm with well defined abstractions.
That depends on how many details I specify. If I specify a lot, I usually get what I want. But in the extreme this is just another form of coding (high quality code is quite similar to a detailed spec). In many cases I find I have to do many "passes" to get the right balance of correctness, performance, security, and clean architectural boundaries. Having a loop to fix these often makes it worse since they can often be contradictory.
There's also some types of code that I believe is often wrong in the training data that is almost always wrong in the LLM output as well. Typically anything that should have been a state machine, like auth flows, wizards, etc.
When all is said and done I think the main savings come from the high throughput of low-value generic solutions. I don't currently see this changing, and the reason is that high quality products cannot be generated without specifying a lot of details. Of course, we may not want quality.
It doesn't push back enough when I ask it to implement something that is a bad idea. I've caught it breaking threading code, and then fixing it by putting atomic in the wrong place: which might have fixed in that a few racey tests can now run a thousands times, but it wasn't the right place and so the race still exist. It has changed constants when moving the constant to a different file (the move was required, but the constant were not expected to change).
I have found it very helpful to ask AI to review the latest changes - it often finds serious problems in review of code it just wrote.
> which might have fixed in that a few racey tests can now run a thousands times, but it wasn't the right place and so the race still exist.
One somewhat related thing I've noticed from all the major LLMs is a tendency to 'fix' multithreading race conditions by inserting delay calls of seemingly random amounts of milliseconds to 'give things time to settle'.
It kinda makes sense why they do it because I've seen similar human generated code over the years, but it doesn't instill a lot of confidence in unreviewed 'vibecoded' systems.
I had a fun one where Opus 4.6 could not properly export a 3D model to a 3MF for multi-color 3D printing. Ultimately I ended up having it output each color individually and I just import them together into the slicer.
Respecting instructions around tool use, even on small codebases, where the tool isn't its favourite way of doing things.
For example, models repeatedly try and not use my Makefile's (build / run / test / test-file / check) and instead spend several cycles attempting to invoke using system tools and getting it wrong.
I've got to the point where I run most models with an extra path folder which overrides typically bad commands tells them off and redirects them back (more effective and portable than hooks). But then sometimes the model reads the makefile, discovers the workaround, and does the stupid anyway.
Most recently: Opus 4.6 screwed up using a common and well documented API (Qt), then when asked to debug the observed issues, blamed "a Qt bug" and wrote a whole layer on top of the API to work around the issue caused by its incorrect use of the API.
It did the above twice in a row around different parts of the API.
The thought that there are almost certainly devs out there merging Claude PRs without the skills or volition to push back on its screwups is not comfortable.
Recently working on implementing a "MS Teams for the terminal" (video conf, audio, chat, file sharing, recording, etc, the usual things you would expect). Linux and Mac, FBSD and others to follow. Prior to the 1M context window I found I had to restrict myself to specific functionality/areas of the codebase. I'd gotten lazy anyway so this was no bad thing. Reduced the "vibe" quotient of the AI coding.
Depends - in the pure technical "implementation" level, only limits I've found are the ones with outdated knowledge - libraries, platforms, availability of some things.
But one big limit is the DX. Their choice of DX is usually abysmal - ironically, just like an average devs. They seem to lack the aesthetic instinct for code, so you have to really point them hard into the direction or provide a sample of the expected DX, for them to still fight against it at every turn.
While understandable in a way, as they are trained on average code and most code will now be written by the machines anyways making the DX "less relevant", it's also a giant code smell, as bad DX tends to point towards bad internals and wrong decisions along the way.
So not really a technical limit - they swallow anything you throw at them, even the most complex cases - but more of an aesthetic limit in terms of taste.
Hard limit:
Soft limit:On this note, one thing I've found Codex to do is worry more than necessary about breaking changes for internal APIs. Maybe a bit more prompting would fix this, but I found even when iteratively implementing larger new features, it worries about breaking APIs that aren't used by anything but the new code yet.
That last part about it acting like a junior matches my experience very well. I'm using LLM's for refactoring, adding repetitive blocks of code, etc.
Unless I'm very clear at all times it will write code like the most annoying stubborn junior you've ever worked with. Nothing is sacred, everything can be abbreviated, shortened, made more confusing, made less readable, and concepts like readability or naming conventions are not even considered.
It also adds superfluous nonsense comments that don't explain the "why".
I think consistency levels when operating autonomously is still a challenge which you could refer to as their limit. You need to do so much around them to keep the consistency AND a decent level of performance. It's like a savant with a short-attention span.
I found sonnet 4.5 struggled with a two pointer interval merge (two sorted lists of things with start stop timestamps), but opus 4.5 managed. Then it took opus 4.6 to make it a three way or k way merge. So it reminded me of the classic Simple made Easy talk by Rich Hickey where he talks about braids. Sonnet couldn’t track two twisted braids and opus 4.6 managed a weave.
But the last few weeks Opus 4.6 seems to have got dumb again. Now it is making way more mistakes and forgetting useful things and recent context it used to manage.
I am guessing this is just Anthropic quietly dialling down the real effort as they either downsize or free up compute for someone or something else.
> But the last few weeks Opus 4.6 seems to have got dumb again. Now it is making way more mistakes and forgetting useful things and recent context it used to manage.
are you logging access patterns and times to see when the degradation occurs?
I find that AI models are very bad at doing performance work because they keep guessing how their changes affect things or not really understanding how the profiler results work, leading to them going in circles and taking forever. I have noticed this effect in surprisingly few lines of code (hundreds).
One thing I've found that I've found super helpful for this is converting profiling results to Markdown and feeding it back into the agent in a loop. I've done it with a bit of manual orchestration, but it could probably be automated pretty well. Specifically, pprof-rs[0] and pprof-to-md[1] have worked pretty well for me, YMMV.
[0] https://github.com/tikv/pprof-rs
[1] https://github.com/platformatic/pprof-to-md
Problems that require deep knowledge of multiple repositories, e.g. when trying to debug issues involving dependencies. The models get confused very fast even with all code available locally, due to the size of the problem. But in my experience any kind of deep integration already messes up the models, even within a single repo.
Pretty much anything that isn't Python, (obj)C(++)(#), Java(script) or other somewhat popular languages like swift, go, and rust. They can do other languages, but things get rough for anything complex
It's just really hard for them to write non-verbose code. I don't know if this is incentives from the providers to generate more tokens, but even with guidance on compact code, simple, etc, they just can't really do it right now.
Isn't this just an artifact of their chain-of-thought reasoning? If they are verbose in their output, it's more likely the next word predicted is actually correct.
Also, I see this more as a feature than a bug. In many projects I inherited in the past, I wish the original devs were a bit more verbose. Then again, with every developer using LLMs now, probably the opposite applies now.
At least part of that in my experience seems to be a desire to cover a number of edge cases that may not be practically relevant.
Dendritic Nix (either too new or too underrepresented in training data)
Proper escaping of layered syntaxes in Ansible on the first attempt
Writing bare-metal embedded Rust, although this was long ago, so not current models
I haven't found their limit, but I have found my limits, of waiting for them to do their thing...
I tend to re-enjoy handrolling code more. I delegate the stuff that annoys me.
We have a large multiplatform codebase, the issue seems to be more the time it takes to navigate the code and reason about it, rather than the size. Arguably the size is causing 'them' to be slower in that regards, but I haven't found the limit yet. And with compaction, it's even less of a problem.
my 5c.
I would push back on that question a little, because it has a baked in assumption about how these things work that conflict with my mental model and experience with them.
The reason is that sometimes it spits out something or does a workflow that's pretty sophisticated, and sometimes it fails spectacularly in the most basic ways.
I don't think there is a complexity or domain knowledge limit as there would be with a human. Or at least not in the same sense. As long as it can repeat and remix patterns that it is trained on, then it will do its thing well. The same seems to be true for "reasoning" loops and workflows. It can spit out code that has been done N times before in a similar manner for a large N.
They can still break down because of very trivial issues and assumptions that happen to be baked in, go off the rails and get stuck long loops that are completely insane if you think of them as imitating human programmers.
When I use an agent, I always interview it first about the task. Ask how they would go about it, probe them, give them info that they lack.
Never go from prompt to action. Have them define their approach first, then split the approach into pieces, from gathering data to cleaning it up and so on. If applicable, front-load work that can be achieved with scripts, so you have testable and repeatable steps rather than let it go wild.
So the TLDR is: I think the limitation is simply that it's a non-deterministic token machine that produces useful results enough of the time so it appears to be reasonable.
Quite obvious once you do anything that requires a modicum of logic or coordination (eg working with mutexes, concurrency, etc)
I hate the tendency to make things up and I dont mean hallucinations.
I once had claude code write a python script that emulated the output of my training script, including pretending that loss is decreasing. Why? Because it was unable to install a python dependency.
Everytime I use a coding agent, I need to double check that it's not cutting corners, hard coding things that shouldn't be or straight up rewriting failing test cases.
What I need more is honesty.
Their "this isn't working as expected" failure mode looks very familiar, from thousands of hours of dealing with outdated (and often just bad) internet recommendations over decades.
Last night the chatbot was unable to make a thing work, so it ended up spiraling and doing things like ripping all authentication out of a service. Then it just waved it off as, "it's just a dev service it'll be fine." I didn't tell it to rip all the authentication out, it just got frustrated. Yeah, I'm anthropomorphizing, but that's essentially how it went.
When the bot says, "I'm ready to throw something. Just rip it out and I'll deal with it another day." I can't help but sympathize.
They don't understand esoteric areas of computer science very well at all.
I had a mistake in which a large back-up file deletion event happened during a robocopy. 600gb of files got 'deleted' (file headers toast etc). Trying to get the LLMs to understand the hunt parameters, what to focus on, what not to focus on - none of them could reasonably come close to doing file content recovery properly. I needed to build a custom solution because the available industry options couldn't do what was required and the LLMs were useless for that (including the latest versions of Claude, Gemini and GPT). They just went around in circles, capped by their apparently weak knowledge of file recovery as a field. That is, creativity was their limitation.
My spin on this is that if there isn't a large corpus of code addressing the problem, then the LLM will do poorly for want of training data.