I'm still calibrating myself on the size of task that I can get Claude Code to do before I have to intervene.
I call this problem the "goldilocks" problem. The task has to be large enough that it outweighs the time necessary to write out a sufficiently detailed specification AND to review and fix the output. It has to be small enough that Claude doesn't get overwhelmed.
The issue with this is, writing a "sufficiently detailed specification" is task dependent. Sometimes a single sentence is enough, other times a paragraph or two, sometimes a couple of pages is necessary. And the "review and fix" phase again is totally dependent and completely unknown. I can usually estimate the spec time but the review and fix phase is a dice roll dependent on the output of the agent.
And the "overwhelming" metric is again not clear. Sometimes Claude Code can crush significant tasks in one shot. Other times it can get stuck or lost. I haven't fully developed an intuition for this yet, how to differentiate these.
What I can say, this is an entirely new skill. It isn't like architecting large systems for human development. It isn't like programming. It is its own thing.
This illustrates a fundamental truth of maintaining software with LLMs: While programmers can use LLMs to produce huge amounts of code in a short time, they still need to read and understand it. It is simply not possible to delegate understanding a huge codebase to an AI, at least not yet.
In my experience, the real "pain" of programming lies in forcing yourself to absorb a flood of information and connecting the dots. Writing code is, in many ways, like taking a walk: you engage in a cognitively light activity that lets ideas shuffle, settle, and mature in the background.
When LLMs write all the code for you, you lose that essential mental rest. The quiet moments where you internalize concepts, spot hidden bugs, and develop a mental map of the system.
This should be called the eternal, unbearable slowness of code review, because the author writes that the AI actually churns out code extremely rapidly. The (hopefully capable, attentive, careful) human is the bottleneck here, as it should be
> ... I’ll keep pulling PRs locally, adding more git hooks to enforce code quality, and zooming through coding tasks—only to realize ChatGPT and Claude hallucinated library features and I now have to rip out Clerk and implement GitHub OAuth from scratch.
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?
Prompting it better during development can really help here.
I have an emerging workflow orchestrated by Claude Code custom commands and subagents that turns even an informal description of a feature into a full fledged PRD, then an "architect" command researches and produces a well thought out and documented technical design. I can review that design document and then give it to the "planner" command, which breaks it down into Phases and Tasks. Then I have a "developer" command iterate through through and implement the Phases one by one. After each phase it runs a detailed code review using my "review" subagent.
Since I've started using this document-driven, guided workflow I've seen quality of the output noticeably improve.
Gemini CLI is pretty weak, but the Gemini 2.5 pro is still the best for long contexts. Claude is great but it crumbles as you start to get in the 50-100k range. I find Gemini doesn't start to crack until the 150-200k range. It's too bad the tooling around it is mediocre at best.
When building a project from scratch using AI, it can be tempting to give in to the vibe and ignore the structure/architecture and let it evolve naturally. This is a bad idea when humans do it, and it's also a bad idea when LLM agents do it. You have to be considering architecture, dataflow, etc from the beginning, and always stay on top of it without letting it drift.
I have tried READMEs scattered through the codebase but I still have trouble keeping the agent aware of the overall architecture we built.
AI tools seem excellent at getting through boilerplate stuff at the start of a project. But as time goes on and you have to think about what you are doing, it'll be faster to write it yourself than to convey it in natural language to an LLM. I don't see this as an issue with the tool, but just getting a better idea of what it is really good for.
The role of a software engineer is to condense the (often unclear) requirements, business domain knowledge, existing code (if any) and their skills/experience into a representation of the solution in a very concise language: a programming language.
Having to instead express all that (including the business-related part, since the agent has no context of that) in a verbose language (English) feels counter-productive, and is counter-productive in my experience.
I've successfully one-shotted easy self-contained, throwaway tasks ("make me a program that fills Redis with random keys and values" - Claude will one-shot that) but when it comes to working with complex existing codebases I've never seen the benefits - having to explain all the context to the agent and correcting its mistakes takes longer than just doing it myself (worse, it's unpredictable - I know roughly how long something will take, but it's impossible to tell in advance whether an agent will one-shot it successfully or require longer babysitting than just doing it manually from the beginning).
We are going to end up having boilerplate natural language text, that's been tested and proven to get the same output every time. Then we'll have a sort of transpiler and maybe a sub language of English, to make prompting easier. Then we will source control those prompts. What we actually do today, with extra steps.
I've found LLMs to be very good at writing design docs and finding problems in code.
Currently they're better at locating problems than fixing them without direction. Gemini seems smarter and better at architecture and best practices. Claude seems dumber but is more focused on getting things done.
The right solution is going to be a variety of tools and LLMs interacting with each other. But it's going to take real humans having real experience with LLMs to get there. It's not something that you can just dream up on paper and have it work out well since it depends so much on the details of the current models.
Somewhat related, I Found cursor/VS was slowing to the point of being unusable. Turning on privacy mode helped, but the main culprit was extremely verbose logging. Running `fatrace -c --command=cursor` discovered the issue.
The disk in question was an HDD and the problem disappeared (or is better hidden) after symlinking the log dir to an SSD.
As for code itself, I've never had an issue with slowness. If anything it's the verbosity of wanting to explain itself and excess logging in the code it creates.
I've never done QA.
Just thinking about doing QA makes my head swirl.
But yes, because of LLMs I am now a part time QA engineer, and I think that it's kinda helping me be a better developer.
Im working on a massive feature at work, something I can't just give to an agent and I already feel like something changed in how I think about every little piece of code im adding. didn't see that coming.
type safety, integration testing, and thorough readmes are now cheap, I don't know why any developer would not be using them with claude code. even if all the LLM services go under tomorrow you'll still have code that practically autocompletes itself.
My employer hosts one of the largest Ruby on Rails apps in the world. I've noticed that Claude Code takes a long time to grep for what it needs. Cursor is much better at this (probably because of local project indexing). Due to this, I favor Cursor over CC in my day to day workflows. In smaller code bases, both are pretty fast.
Even it's slow, you can run multiple agents. You can have one doing changes, while another writes documentation, while another does security checks, while another looks for optimizations. Persist finding to markdown files to track progress and for cross-agent knowledge sharing if need. And do whatever else while it's all running. This has been my experience.
Maybe I’ve misunderstood this, so correct me if I’m wrong… do actual professional developers let enough code be generated to include entire libraries that handle things as important as authentication, and then build on top of it without making sure the previously generated code actually does what it’s supposed to? Just accept local PRs written by AI, with a very sternly worded “now you better not make any bullshit” system prompt? All this just in time to ramp up AI penetration tools. Jesus.
It’s kind of crazy to me how the cool kid take on software development, as recent as 3 years ago, was: strictly-typed everything, ‘real men’ don’t use garbage collection, everything must be optimized to death even when it isn’t really necessary, etc. and now it seems to be ‘you don’t seriously expect me to look at ‘every single line of code’ I submit, do you?’
The mistake you’re making is assuming it’s the same group of people saying both things. The “strictly typed, no GC, optimize everything” crowd hasn’t suddenly turned into the “lol I don’t read my AI-generated PRs” crowd. Those are two different tribes of devs with completely different value systems.
What’s changed isn’t that the same engineers did a 180 on principles, it’s that the discourse got hijacked by a new set of people who think shipping fast with AI is cooler than sweating over type systems. The obsession with performance purity was always more of a niche cultural flex than a universal law, and now the flex du jour is “look how much I can outsource to the machine.”
I think people are missing the point of this technology. It likely helps with some trivial tasks for seasoned engineers, but the real upside is that it makes software creation more accessible for non-engineers. It democratizes creation in the same way that you no longer need a studio to produce a record or a soundstage to produce a movie.
The flip side is that nobody is making a new Avengers movie with just a laptop, but that was never the expectation. Blockbuster movies still benefit from technology both in making menial tasks easier and allowing the next generation of filmmakers an easier way to get started. Software benefits from AI in exactly the same way, which is to say, it can help but it won't be game changing for everyone, especially at the high end.
In the same way democratizing content creation means you get many more awful songs/YouTube videos, AI also means you get more terrible software, which can of course be dangerous. But on the whole, it's vastly preferable for more people to have the ability to create than fewer.
Well yeah, as the app scales it will bump up against context limits. Giving it sandboxed areas to do specific tasks will speed it up again, but that’s not possible with everything.
I'm still calibrating myself on the size of task that I can get Claude Code to do before I have to intervene.
I call this problem the "goldilocks" problem. The task has to be large enough that it outweighs the time necessary to write out a sufficiently detailed specification AND to review and fix the output. It has to be small enough that Claude doesn't get overwhelmed.
The issue with this is, writing a "sufficiently detailed specification" is task dependent. Sometimes a single sentence is enough, other times a paragraph or two, sometimes a couple of pages is necessary. And the "review and fix" phase again is totally dependent and completely unknown. I can usually estimate the spec time but the review and fix phase is a dice roll dependent on the output of the agent.
And the "overwhelming" metric is again not clear. Sometimes Claude Code can crush significant tasks in one shot. Other times it can get stuck or lost. I haven't fully developed an intuition for this yet, how to differentiate these.
What I can say, this is an entirely new skill. It isn't like architecting large systems for human development. It isn't like programming. It is its own thing.
"It isn't like programming. It is its own thing."
You articulated what I was wrestling with in the post perfectly.
This illustrates a fundamental truth of maintaining software with LLMs: While programmers can use LLMs to produce huge amounts of code in a short time, they still need to read and understand it. It is simply not possible to delegate understanding a huge codebase to an AI, at least not yet.
In my experience, the real "pain" of programming lies in forcing yourself to absorb a flood of information and connecting the dots. Writing code is, in many ways, like taking a walk: you engage in a cognitively light activity that lets ideas shuffle, settle, and mature in the background.
When LLMs write all the code for you, you lose that essential mental rest. The quiet moments where you internalize concepts, spot hidden bugs, and develop a mental map of the system.
100% yes. QA'ing a bunch of LLM generated code feels like a mental flood. Losing that mental rest is a great way to put it.
This should be called the eternal, unbearable slowness of code review, because the author writes that the AI actually churns out code extremely rapidly. The (hopefully capable, attentive, careful) human is the bottleneck here, as it should be
If only code and application quality could be measured in LoC - middle managers everywhere would rejoice
> ... I’ll keep pulling PRs locally, adding more git hooks to enforce code quality, and zooming through coding tasks—only to realize ChatGPT and Claude hallucinated library features and I now have to rip out Clerk and implement GitHub OAuth from scratch.
I don't get this, how many git hooks do you need to identify that Claude had hallucinated a library feature? Wouldn't a single hook running your tests identify that?
They probably don't have any tests, or the tests that the LLM creates are flawed and not detecting these problems
Yesterday Claude Code assured me the following:
• Good news! The code is compiling successfully (the errors shown are related to an existing macro issue, not our new code).
When infact, it managed to insert 10 compilation errors that were not at all related with any macros.
Just tell the AI "and make sure you don't add bugs or break anything"
Works every time
I tried using agents in Cursor and when it runs into issues it will just rip out the offending code :)
"hallucinated" library features are identified even earlier, when claude builds your project. i also don't get what author is talking about.
AI agents have been known to rip out mocks so that the tests pass.
I have had human devs do that too
Prompting it better during development can really help here.
I have an emerging workflow orchestrated by Claude Code custom commands and subagents that turns even an informal description of a feature into a full fledged PRD, then an "architect" command researches and produces a well thought out and documented technical design. I can review that design document and then give it to the "planner" command, which breaks it down into Phases and Tasks. Then I have a "developer" command iterate through through and implement the Phases one by one. After each phase it runs a detailed code review using my "review" subagent.
Since I've started using this document-driven, guided workflow I've seen quality of the output noticeably improve.
Gemini CLI is pretty weak, but the Gemini 2.5 pro is still the best for long contexts. Claude is great but it crumbles as you start to get in the 50-100k range. I find Gemini doesn't start to crack until the 150-200k range. It's too bad the tooling around it is mediocre at best.
When building a project from scratch using AI, it can be tempting to give in to the vibe and ignore the structure/architecture and let it evolve naturally. This is a bad idea when humans do it, and it's also a bad idea when LLM agents do it. You have to be considering architecture, dataflow, etc from the beginning, and always stay on top of it without letting it drift.
I have tried READMEs scattered through the codebase but I still have trouble keeping the agent aware of the overall architecture we built.
Slow is smooth, smooth is fast.
AI tools seem excellent at getting through boilerplate stuff at the start of a project. But as time goes on and you have to think about what you are doing, it'll be faster to write it yourself than to convey it in natural language to an LLM. I don't see this as an issue with the tool, but just getting a better idea of what it is really good for.
The role of a software engineer is to condense the (often unclear) requirements, business domain knowledge, existing code (if any) and their skills/experience into a representation of the solution in a very concise language: a programming language.
Having to instead express all that (including the business-related part, since the agent has no context of that) in a verbose language (English) feels counter-productive, and is counter-productive in my experience.
I've successfully one-shotted easy self-contained, throwaway tasks ("make me a program that fills Redis with random keys and values" - Claude will one-shot that) but when it comes to working with complex existing codebases I've never seen the benefits - having to explain all the context to the agent and correcting its mistakes takes longer than just doing it myself (worse, it's unpredictable - I know roughly how long something will take, but it's impossible to tell in advance whether an agent will one-shot it successfully or require longer babysitting than just doing it manually from the beginning).
We are going to end up having boilerplate natural language text, that's been tested and proven to get the same output every time. Then we'll have a sort of transpiler and maybe a sub language of English, to make prompting easier. Then we will source control those prompts. What we actually do today, with extra steps.
I've found LLMs to be very good at writing design docs and finding problems in code.
Currently they're better at locating problems than fixing them without direction. Gemini seems smarter and better at architecture and best practices. Claude seems dumber but is more focused on getting things done.
The right solution is going to be a variety of tools and LLMs interacting with each other. But it's going to take real humans having real experience with LLMs to get there. It's not something that you can just dream up on paper and have it work out well since it depends so much on the details of the current models.
Somewhat related, I Found cursor/VS was slowing to the point of being unusable. Turning on privacy mode helped, but the main culprit was extremely verbose logging. Running `fatrace -c --command=cursor` discovered the issue.
The disk in question was an HDD and the problem disappeared (or is better hidden) after symlinking the log dir to an SSD.
As for code itself, I've never had an issue with slowness. If anything it's the verbosity of wanting to explain itself and excess logging in the code it creates.
I've never done QA. Just thinking about doing QA makes my head swirl. But yes, because of LLMs I am now a part time QA engineer, and I think that it's kinda helping me be a better developer. Im working on a massive feature at work, something I can't just give to an agent and I already feel like something changed in how I think about every little piece of code im adding. didn't see that coming.
type safety, integration testing, and thorough readmes are now cheap, I don't know why any developer would not be using them with claude code. even if all the LLM services go under tomorrow you'll still have code that practically autocompletes itself.
My employer hosts one of the largest Ruby on Rails apps in the world. I've noticed that Claude Code takes a long time to grep for what it needs. Cursor is much better at this (probably because of local project indexing). Due to this, I favor Cursor over CC in my day to day workflows. In smaller code bases, both are pretty fast.
Even it's slow, you can run multiple agents. You can have one doing changes, while another writes documentation, while another does security checks, while another looks for optimizations. Persist finding to markdown files to track progress and for cross-agent knowledge sharing if need. And do whatever else while it's all running. This has been my experience.
But then you have to keep all those tasks in your head and be ready to jump into any of them.
The check-ins are much more frequent and the instructions much lower level than what you’d give to a team if you were running it.
Do you have an example of a large application you’ve released with this methodology that has real paying users that isn’t in the AI space?
OP says in 2nd paragraph that they are using multiple agents in parallel. In fact, that's what their app does.
if they are modifying the same code, then you have to merge all of different changes so it's not really parallel.
IME it's faster to not try to edit the same code in parallel because of the cost of merging.
Maybe I’ve misunderstood this, so correct me if I’m wrong… do actual professional developers let enough code be generated to include entire libraries that handle things as important as authentication, and then build on top of it without making sure the previously generated code actually does what it’s supposed to? Just accept local PRs written by AI, with a very sternly worded “now you better not make any bullshit” system prompt? All this just in time to ramp up AI penetration tools. Jesus.
It’s kind of crazy to me how the cool kid take on software development, as recent as 3 years ago, was: strictly-typed everything, ‘real men’ don’t use garbage collection, everything must be optimized to death even when it isn’t really necessary, etc. and now it seems to be ‘you don’t seriously expect me to look at ‘every single line of code’ I submit, do you?’
The mistake you’re making is assuming it’s the same group of people saying both things. The “strictly typed, no GC, optimize everything” crowd hasn’t suddenly turned into the “lol I don’t read my AI-generated PRs” crowd. Those are two different tribes of devs with completely different value systems.
What’s changed isn’t that the same engineers did a 180 on principles, it’s that the discourse got hijacked by a new set of people who think shipping fast with AI is cooler than sweating over type systems. The obsession with performance purity was always more of a niche cultural flex than a universal law, and now the flex du jour is “look how much I can outsource to the machine.”
I think people are missing the point of this technology. It likely helps with some trivial tasks for seasoned engineers, but the real upside is that it makes software creation more accessible for non-engineers. It democratizes creation in the same way that you no longer need a studio to produce a record or a soundstage to produce a movie.
The flip side is that nobody is making a new Avengers movie with just a laptop, but that was never the expectation. Blockbuster movies still benefit from technology both in making menial tasks easier and allowing the next generation of filmmakers an easier way to get started. Software benefits from AI in exactly the same way, which is to say, it can help but it won't be game changing for everyone, especially at the high end.
In the same way democratizing content creation means you get many more awful songs/YouTube videos, AI also means you get more terrible software, which can of course be dangerous. But on the whole, it's vastly preferable for more people to have the ability to create than fewer.
Well yeah, as the app scales it will bump up against context limits. Giving it sandboxed areas to do specific tasks will speed it up again, but that’s not possible with everything.
I split large large task in 4-5 small sub tasks, each in new conversation to save tokens and it does a pretty good job.