"You can ask the agent for advice on ways to improve your application, but be really careful; it loves to “improve” things, and is quick to suggest adding abstraction layers, etc. Every single idea it gives you will seem valid, and most of them will seem like things that you should really consider doing. RESIST THE URGE..."
A thousand times this. LLMs love to over-engineer things. I often wonder how much of this is attributable to the training data...
They’re not dissimilar to human devs, who also often feel the need to replat, refactor, over-generalize, etc.
The key thing in both cases, human and AI, is to be super clear about goals. Don’t say “how can this be improved”, say “what can we do to improve maintainability without major architectural changes” or “what changes would be required to scale to 100x volume” or whatever.
Open-ended, poorly-defined asks are bad news in any planning/execution based project.
A senior programmer does not suggest adding more complexity/abstraction layers just to say something. An LLM absolutely does, every single time in my experience.
You might not, but every "senior" programmer I have met on my journey has provided bad answers like the LLMs - and because of them I have an inbuilt verifier that means I check what's being proposed (by "seniors" or LLMs)
There are however human developers that have built enough general and project-specific expertise to be able to answer these open-ended, poorly-defined requests. In fact, given how often that happens, maybe that’s at the core of what we’re being paid for.
I have to be honest, I've heard of these famed "10x" developers, but when I come close to one I only ever find "hacks" with a brittle understanding of a single architecture.
This is something I experienced first hand a few weeks ago when I first used Claude. I have this recursive-decent-based parser library I haven't touched in a few years that I want to continue developing but always procrastinate on. It has always been kinda slow so I wanted to see if Claude could improve the speed. It made very reasonable suggestions, the main one being caching parsing rules based on the leading token kind. It made code that looked fine and didn't break tests, but when I did a simple timed looped performance comparison, Claude's changes were slightly slower. Digging through the code, I discovered I already was caching rules in a similar way and forgot about it, so the slight performance loss was from doing this twice.
Caching sounds fine, and it is a very potent method. Nevertheless, I avoid using it until I have almost no other options left, and no good ones. You now have to manage that cache, introduce a potential for hard to debug and rare runtime timing errors, and add a lot of complexity. For me, adding caching should come at the end, when the whole project is finished, you exhausted all your architecture options, and you still need more speed. And I'll add some big warnings, and pray I don't run into too many new issues introduced by caching.
It's better for things that are well isolated and definitely completely "inside the box" with no apparent way for the effects to have an effect outside the module, but you never know when you overlook something, or when some later refactoring leads to the originally sane and clean assumptions to be made invalid without anyone noticing, because whoever does the refactoring only looks at a sub-section of the code. So it is not just a question of getting it right for the current system, but to anticipate that anything that can go wrong might actually indeed go wrong, if I leave enough opportunities (complexity) even in right now well-encapsulated modules.
I mean, it's like having more than one database and you have to use both and keep them in sync. Who does that voluntarily? There's already caching inside many of the lower levels, from SSDs, CPUs, to the OS, and it's complex enough already, and can lead to unexpected behavior. Adding even more of that in the app itself does not appeal to me, if I can help it. I'm just way too stupid for all this complexity, I need it nice and simple. Well, as nice and simple as it gets these days, we seem to be moving towards biological system level complexity in larger IT systems.
If you are not writing the end-system but a library, there is also the possibility that the actual system will do its own caching on a higher level. I would carefully evaluate if there is really a need to do any caching inside my library? Depending on how it is used, the higher level doing it too would likely make that obsolete because the library functions will not be called as often as predicted in the first place.
There is also that you need a very different focus and mindset for the caching code, compared to the code doing the actual work. For caching, you look at very different things than what you think about for the algorithm. For the app you think on a higher level, how to get work done, and for caching you go down into the oily and dirty gear boxes of the machine and check all the shafts and gears and connections. Ideally caching would not be part of the business code at all, but it is hard to avoid and the result is messy, very different kinds of code, dealing with very different problems, close together or even intertwined.
One weird trick is to tell the LLM to ask you questions about anything that’s unclear at this point. I tell it eg to ask up to 10 questions. Often I do multiple rounds of these Q&A and I‘m always surprised at the quality of the questions (w/ Opus). Getting better results that way, just because it reduces the degrees of freedom in which the agent can go off in a totally wrong direction.
This is more or less what the "architect" mode is in KiloCode. It does all the planning and documentation, and then has to be switched to "Code" in order to author any of it. It allows me to ensure we're on the same page, more or less, with intentions and scope before giving it access to writing anything.
It consumes ~30-40% of the tokens associated with a project, in my experience, but they seem to be used in a more productive way long-term, as it doesn't need to rehash anything later on if it got covered in planning. That said, I don't pay too close attention to my consumption, as I found that QwenCoder 30B will run on my home desktop PC (48GB RAM/12GB vRAM) in a way that's plenty functional and accomplishes my goals (albeit a little slower than Copilot on most tasks).
Workflow improvement: Use a repo bundler to make a single file and drop your entire codebase in gemini or chatgpt. Their whole codebase comprehension is great and you can chat for a long time without the api cost. You can even get them to comment on each other's feedback, it's great.
I can get useful info from gemini on 120k loc projects with repomix ignoring a few select files. If you're in the enterprise world obviously it's a different thing.
This is a little anthropomorphic. The faster option is to tell it to give you the full content of an ideal context for what you’re doing and adjust or expand as necessary. Less back and forth.
It’s not though, one of the key gaps right now is that people do not provide enough direction on the tradeoffs they want to make. Generally LLMs will not ask you about them, they will just go off and build. But if you have them ask, they will often come back with important questions about things you did not specify.
This is the correct answer. I like to go one step further than the root comment:
Nearly all of my "agents" are required to ask at least three clarifying questions before they're allowed to do anything (code, write a PRD, write an email newsletter, etc)
Force it to ask one at a time and it's event better, though not as step-function VS if it went off your initial ask.
I think the reason is exactly what you state @7thpower: it takes a lot of thinking to really provide enough context and direction to an LLM, especially (in my opinion) because they're so cheap and require no social capital cost (vs asking a colleague / employee—where if you have them work for a week just to throw away all their work it's a very non-zero cost).
Is it? By far the majority of code the LLMs are trained on is going to be from Git repositories. So the idea that stack overflow question and answer sections with buggy code is dominating the training sets seems unlikely. Perhaps I'm misunderstanding?
The post wasn't saying that StackOverflow Q&A sections with buggy code dominate the training sets. They're saying that despite whatever amount of code in there from Git repositories, the process of generating and debugging code cannot be found in the static code that exists in Github repos; that process is instead encoded in the conversations on SO, git hub issues, various forums, etc. So if you want to start from buggy code and go to correct code in the way the LLM was trained, you would do that by simulating the back and forth found in a SO question, so that when the LLM is asked for the next step, it can rely on its training.
Thanks! Okay, I agree it's an interesting concept, but I'm not sure if it's actually true, but I can see why it might be. I appreciate your clarification!
I took gp to be a complaint that you had to sort of go through this buggy code loop over and over because of how the LLM was trained. Maybe I read sarcasm at the end if the post when there was none.
Define heavy... There's a band where the max subscription makes most sense. Thread here talks $1000/month, the plan beats that. But there's a larger area beyond that where you're back to having to use API or buy credits.
A full day of Opus 4.1 or GPT 5 high reasoning doing pair programming or guided code review across multiple issues or PRs in parallel will burn the max monthly limits and then stop you or cost $1500 in top up credits for a 15 hour day. Wait, WTF, that's $300k/year! OK, while true, misses that that's accomplishing 6 - 8 in parallel, all day, with no drop in efficacy.
At enterprise procurement cost rates, hiring a {{specific_tech}} expert can run $240/hr or $3500/day and is (a) less knowledgable on the 3+ year old tech the enterprise is using, (b) wants to advise instead of type.
So the question then isn't what it costs, it's what's the cost of being blocked and in turn blocking committers waiting for reviews? Similarly, what's the cost of a Max for a dev that doesn't believe in using it?
TL;DR: At the team level, for guided experts and disbelievers, API likely ends up cheaper again.
Am I alone in spending $1k+/month on tokens? It feels like the most useful dollars i've ever spent in my life. The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.
Not OP but I know it can be difficult to really difficult to measure or communicate this to people who aren't familiar with the codebase or the problem being solved.
Other than just dumping 10M tokens of chats into a gist and say read through everything I said back and forth with claude for a week.
But, I think I've got the start of a useful summary format: it that takes every prompt and points to the corresponding code commit produced by ai + adds a line diff amount and summary of the task. Check it out below.
I’m unclear how you’re hitting $1k/mo in personal usage. GitHub Copilot charges $0.04 per task with a frontier model in agent mode - and it’s considered expensive. That’s 850 coding tasks per day for $1k/mo, or around 1 per minute in a 16hr day.
I’m not sure a single human could audit & review the output of $1k/mo in tokens from frontier models at the current market rate. I’m not sure they could even audit half that.
You don't audit and review all $1k worth of tokens!
The AI might write ten versions. Versions 1-9 don't compile, but it automatically makes changes and gets further each time. Version 10 actually builds and seems to pass your test suite. That is the version you review!
—and you might not review the whole thing! 20 lines in, you realize the AI has taken a stupid approach that will obviously break, so you stop reading and tell the AI it messed up. This triggers another ~5 rounds of producing code before something compiles, which you can then review, hopefully in full this time if it did a good job.
I can easily hit the daily usage limits on Claude Code or Openai Codex by asking for more complex tasks to be done which often take relatively little time to review.
There's a lot of tokens used up quickly for those tools to query the code base, documentation, try changes, run commands, re-run commands to call tools correctly, fix errors, etc.
At any rate, I could easily go through that much with Opus because it’s expensive and often I’m loading the context window to do discovery, this may include not only parts of a codebase but also large schemas along with samples of inputs and outputs.
When I’m done with that, I spend a bunch of turns defining exactly what I want.
Now that MCP tools work well, there is also a ton of back and forth that happens there (this is time efficient, not cost efficient). It all adds up.
I have Claude code max which helps, but one of the reasons it’s so cheap is all of the truncation it does, so I have a different tool I use that lets me feed in exactly the parts of a codebase that I want to, which can be incredibly expensive.
This is all before the expenses associated with testing and evals.
I’m currently consulting, a lot of the code is ultimately written by me, and everything gets validated by me (if the LLM tells me about how something works, I don’t just take its word for it, I go look myself), but a lot of the work for me happens before any code is actually written.
My ability (usually clarity of mind and patience) to review an LLMs output is still a gating factor, but the costs can add up quickly.
I use it all the time. I am not into claude code style agentic coding. More of the "change the relevant lines and let me review" type.
I work in web dev, with vs code I can easily select a line of code that's wrong which I know how to fix but honestly tired to type, press Ctrl+I and tell it to fix. I know the fix, I can easily review it.
Gpt 4.1 agent mode is unlimited in the pro tier. It's half the cost of claude, gemini, and chatgpt. The vs code integration alone is worth it.
Now that is not the kind of AI does everything coding these companies are marketing and want you to do, I treat it like an assistant almost. For me it's perfect.
I trust Copilot way more than any agentic coder. First time I used Claude it went through my working codebase and tried to tell me it was broken in all these places it wasn't. It suggested all these wrong changes that if applied would have ruined my life. Given that first impression, it's going to take a lot to convince me agentic coding is a worthwhile tool. So I prefer Copilot because it's a much more conservative approach to adding AI to my workflow.
I would if there were any positive ROI for these $12k/year, or if it were a small enough fraction of my income. For me, neither are true, so I don’t :).
Like the siblings I would be interested in having your perspective on what kind of thing you do with so many tokens.
If freelancing and if I am doing 2x as much as previously with same time, it would make sense that I am able to make 2x as much. But honestly to me with many projects I feel like I was able to scale my output far more than 2x. It is a different story of course if you have main job only. But I have been doing main job and freelancing on the side forever now.
I do freelancing mostly for fun though, picking projects I like, not directly for the money, but this is where I definitely see multiples of difference on what you can charge.
You're not alone in using $1k+/month in tokens. But if you are spending that much, you should definitely be on something like Anthropic's Max plan instead of going full API, since it is a fraction of the cost.
I would personally never. Do I want to spend all my time reviewing AI code instead of writing? Not really. I also don't like having a worse mental model of the software.
What kind of software are you building that you couldn't before?
> One of the weird things I found out about agents is that they actually give up on fixing test failures and just disable tests. They’ll try once or twice and then give up.
Its important to not think in terms of generalities like this. How they approach this depends on your tests framework, and even on the language you use. If disabling tests is easy and common in that language / framework, its more likely to do it.
For testing a cli, i currently use run_tests.sh and never once has it tried to disable a test. Though that can be its own problem when it hits 1 it can't debug.
# run_tests.sh
# Handle multiple script arguments or default to all .sh files
Another tip. For a specific tasks don't bother with "please read file x.md", Claude Code (and others) accept the @file syntax which puts that into context right away.
Some of these sample prompts in this blog post are extremely verbose:
If you are considering leveraging any of the documentation or examples, you need to validate that the documentation or example actually matches what is currently in the code.
I have better luck being more concise and avoiding anthropomorphizing. Something like:
"validate documentation against existing code before implementation"
I have the best luck with RFC speak. “You MUST validate that the documentation validates existing code before implementation. You MAY update documentation to correct any mismatches.”
But I also use more casual style when investigating. “See what you think about the existing inheritance model, propose any improvements that will make it easier to maintain. I was thinking that creating a new base class for tree and flower to inherit from might make sense, but maybe that’s over complicating things”
(Expressing uncertainty seems to help avoid the model latching on to every idea with “you’re absolutely right!”)
Also, there's a big difference between giving general "always on" context (as in agents.md) for vibe coding - like "validate against existing code" etc - versus bouncing ideas in a chat session like your example, where you don't necessarily have a specific approach in mind and burning a few extra tokens for a one off query is no big deal.
Context isn't free (either literally or in terms of processing time) and there's definitely a balance to be found for a given task.
I've had both experiences. On some projects concise instructions seem to work better. Other times, the LLM seems to benefit from verbosity.
This is definitely a way in which working with LLMs is frustrating. I find them helpful, but I don't know that I'm getting "better" at using them. Every time I feel like I've discovered something, it seems to be situation specific.
Not my experience at all. I find that the shorter my prompts, the more garbage the results. But if I give it a lot of detail and elaborate on my thought process, it performs very well, and often one-shots the solution.
> If you are a heavy user, you should use pay-as-you go pricing; TANSTAAFL.
This is very very wrong. Anthropic's Max plan is like 10% of the cost of paying for tokens directly if you are a heavy user. And if you still hit the rate-limits, Claude Code can roll-over into you paying for tokens through API credits. Although, I have never hit the rate limits since I upgraded to the $200/month plan.
To be fair, allocating some token for planning (recursively) helps a lot. It requires more hands on work, but produce much better results. Clarifying the tasks and breaking them down is very helpful too. Just you end up spending lots of time on it. On the bright side, Qwen3 30B is quite decent, and best of all "free".
> Finally it occurred to me to put context where it was needed - directly in the test files.
Probably CLAUDE.md is a better place?
> Too much context
Claude’s Sub-agents[1] seems to be a promising way of getting around this, though I haven’t had time to play with the feature too much. Eg when you need to take a context-busting action like debugging dependencies, instead spin up a new agent to read the output and summarize. Then your top-level context doesn’t get polluted.
I've added a subagent to read the "memory_bank" files for project context after being told the task at hand, and summarize only the pertinent parts for the main agent. This is working well to keep the context focused.
I wrote three sub agents this week, one to run unit tests, another to run playwright and a third to write playwright. These are pretty basic boundaries that aren't hard to share context between agent and the orchestrating main agent. It seemed to help a lot. I also have complex ways to run tests (docker, data seeding, auth) and previously it was getting lost. Only compacted a couple times. Was a big improvement.
As a human dev, can I humbly ask you to separate out your LLM "readme" from your human README.md? If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere. If you're putting copy in your codebase that's instructions for a bot, that isn't a README.md. By all means come up with a new convention e.g. BOTS.md for this. As a human dev I know I can safely ignore such a file unless I am working with a bot.
I think things are moving towards using AGENTS.md files: https://agents.md/ . I’d like something like this to become the consensus for most commonly used tools at some point.
> If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere.
While I can understand why someone might develop that first-impression, it's never been safe to assume, especially as one starts working with larger projects or at larger organizations. It's not that unusual for essential sections of the same big project to have their own make-files, specialized utility scripts, tweaks to auto-formatter, etc.
In other cases things are together in a repo for reasons of coordination: Consider frontend/backend code which runs with different languages on different computers, with separate READMEs etc. They may share very little in terms of their build instructions, but you want corresponding changes on each end of their API to remain in lockstep.
Another example: One of my employer's projects has special GNU gettext files for translation and internationalization. These exist in a subdirectory with its own documentation and support scripts, but it absolutely needs to stay within the application that is using it for string-conversions.
This lines up with my own experience of learning how to succeed with LLMs. What really makes them work isn't so different from what leads to success in any setting: being careful up front, measuring twice and cutting once.
I’ve seen going very successfully using both codex with gpt5 and claude code with opus. You develop a solution with one, then validate it with the other. I’ve fixed many bugs by passing the context between them saying something like: “my other colleague suggested that…”.
Bonus thing: I’ve started using symlinks on CLAUDE.md files pointing at AGENTS.md, now I don’t even have to maintain two different context files.
I spent much of the last several months using LLM agents to create software. I've written two blog posts about my experience; this is the second post that includes all the things I've learned along the way to get better results, or at least waste less money.
I perfer building and using software that is robust, heavily tested and thoroughly reviewed by highly experienced software engineers who understand the code, can detect bugs and can explain what each line of code they write does.
Today, we are now in the phase where embracing mediocre LLM generated code over heavily tested / scrutinized code is now encoraged in this industry - because of the hype of 'vibe coding'.
If you can't even begin to explain the code or point out any bugs generated by LLMs or even off-load architectural decisions to them, you're going to have a big problem in explaining that in code review situations or even in a professional pair-programming scenario.
It's a funny comic, but can you actually give an example of what it's talking about? "Properly reviewed" can be construed as "has been working for a long time for a lot of people", which definitely can't be said about any AI process or any AI generated code. At the very least, 1 human person actually sat down and wrote the tools the comic is poking fun at. But with AI, we are currently producing code that was neither peer reviewed nor written (a process which includes revision) -- it was instead "generated". So it's still a step backwards.
> I perfer building and using software that is robust, heavily tested and thoroughly reviewed by highly experienced software engineers who understand the code, can detect bugs and can explain what each line of code they write does.
that's amazing. by that logic you probably use like one or two pieces of software max. no windows, macos or gnome for you.
IMO, a key passage that's buried:
"You can ask the agent for advice on ways to improve your application, but be really careful; it loves to “improve” things, and is quick to suggest adding abstraction layers, etc. Every single idea it gives you will seem valid, and most of them will seem like things that you should really consider doing. RESIST THE URGE..."
A thousand times this. LLMs love to over-engineer things. I often wonder how much of this is attributable to the training data...
They’re not dissimilar to human devs, who also often feel the need to replat, refactor, over-generalize, etc.
The key thing in both cases, human and AI, is to be super clear about goals. Don’t say “how can this be improved”, say “what can we do to improve maintainability without major architectural changes” or “what changes would be required to scale to 100x volume” or whatever.
Open-ended, poorly-defined asks are bad news in any planning/execution based project.
A senior programmer does not suggest adding more complexity/abstraction layers just to say something. An LLM absolutely does, every single time in my experience.
You might not, but every "senior" programmer I have met on my journey has provided bad answers like the LLMs - and because of them I have an inbuilt verifier that means I check what's being proposed (by "seniors" or LLMs)
There are however human developers that have built enough general and project-specific expertise to be able to answer these open-ended, poorly-defined requests. In fact, given how often that happens, maybe that’s at the core of what we’re being paid for.
I have to be honest, I've heard of these famed "10x" developers, but when I come close to one I only ever find "hacks" with a brittle understanding of a single architecture.
Most definitely, asking the LLM those things is the same as asking (people) on Reddit, Stack Overflow, IRC, or even Hacker News
This is something I experienced first hand a few weeks ago when I first used Claude. I have this recursive-decent-based parser library I haven't touched in a few years that I want to continue developing but always procrastinate on. It has always been kinda slow so I wanted to see if Claude could improve the speed. It made very reasonable suggestions, the main one being caching parsing rules based on the leading token kind. It made code that looked fine and didn't break tests, but when I did a simple timed looped performance comparison, Claude's changes were slightly slower. Digging through the code, I discovered I already was caching rules in a similar way and forgot about it, so the slight performance loss was from doing this twice.
Caching sounds fine, and it is a very potent method. Nevertheless, I avoid using it until I have almost no other options left, and no good ones. You now have to manage that cache, introduce a potential for hard to debug and rare runtime timing errors, and add a lot of complexity. For me, adding caching should come at the end, when the whole project is finished, you exhausted all your architecture options, and you still need more speed. And I'll add some big warnings, and pray I don't run into too many new issues introduced by caching.
It's better for things that are well isolated and definitely completely "inside the box" with no apparent way for the effects to have an effect outside the module, but you never know when you overlook something, or when some later refactoring leads to the originally sane and clean assumptions to be made invalid without anyone noticing, because whoever does the refactoring only looks at a sub-section of the code. So it is not just a question of getting it right for the current system, but to anticipate that anything that can go wrong might actually indeed go wrong, if I leave enough opportunities (complexity) even in right now well-encapsulated modules.
I mean, it's like having more than one database and you have to use both and keep them in sync. Who does that voluntarily? There's already caching inside many of the lower levels, from SSDs, CPUs, to the OS, and it's complex enough already, and can lead to unexpected behavior. Adding even more of that in the app itself does not appeal to me, if I can help it. I'm just way too stupid for all this complexity, I need it nice and simple. Well, as nice and simple as it gets these days, we seem to be moving towards biological system level complexity in larger IT systems.
If you are not writing the end-system but a library, there is also the possibility that the actual system will do its own caching on a higher level. I would carefully evaluate if there is really a need to do any caching inside my library? Depending on how it is used, the higher level doing it too would likely make that obsolete because the library functions will not be called as often as predicted in the first place.
There is also that you need a very different focus and mindset for the caching code, compared to the code doing the actual work. For caching, you look at very different things than what you think about for the algorithm. For the app you think on a higher level, how to get work done, and for caching you go down into the oily and dirty gear boxes of the machine and check all the shafts and gears and connections. Ideally caching would not be part of the business code at all, but it is hard to avoid and the result is messy, very different kinds of code, dealing with very different problems, close together or even intertwined.
> I often wonder how much of this is attributable to the training data...
I'd reckon anywhere between 99.9%-100%. Give or take.
One weird trick is to tell the LLM to ask you questions about anything that’s unclear at this point. I tell it eg to ask up to 10 questions. Often I do multiple rounds of these Q&A and I‘m always surprised at the quality of the questions (w/ Opus). Getting better results that way, just because it reduces the degrees of freedom in which the agent can go off in a totally wrong direction.
This is more or less what the "architect" mode is in KiloCode. It does all the planning and documentation, and then has to be switched to "Code" in order to author any of it. It allows me to ensure we're on the same page, more or less, with intentions and scope before giving it access to writing anything.
It consumes ~30-40% of the tokens associated with a project, in my experience, but they seem to be used in a more productive way long-term, as it doesn't need to rehash anything later on if it got covered in planning. That said, I don't pay too close attention to my consumption, as I found that QwenCoder 30B will run on my home desktop PC (48GB RAM/12GB vRAM) in a way that's plenty functional and accomplishes my goals (albeit a little slower than Copilot on most tasks).
Workflow improvement: Use a repo bundler to make a single file and drop your entire codebase in gemini or chatgpt. Their whole codebase comprehension is great and you can chat for a long time without the api cost. You can even get them to comment on each other's feedback, it's great.
Unfortunately that only works with very small projects.
I can get useful info from gemini on 120k loc projects with repomix ignoring a few select files. If you're in the enterprise world obviously it's a different thing.
I often like to just talk out out. Stream of thought. Gives it full context of your mental model. Talk through an excalidraw diagram
I do that as well, with Wispr Flow. But I still forget details that the questions make obvious.
This is a little anthropomorphic. The faster option is to tell it to give you the full content of an ideal context for what you’re doing and adjust or expand as necessary. Less back and forth.
It’s not though, one of the key gaps right now is that people do not provide enough direction on the tradeoffs they want to make. Generally LLMs will not ask you about them, they will just go off and build. But if you have them ask, they will often come back with important questions about things you did not specify.
This is the correct answer. I like to go one step further than the root comment:
Nearly all of my "agents" are required to ask at least three clarifying questions before they're allowed to do anything (code, write a PRD, write an email newsletter, etc)
Force it to ask one at a time and it's event better, though not as step-function VS if it went off your initial ask.
I think the reason is exactly what you state @7thpower: it takes a lot of thinking to really provide enough context and direction to an LLM, especially (in my opinion) because they're so cheap and require no social capital cost (vs asking a colleague / employee—where if you have them work for a week just to throw away all their work it's a very non-zero cost).
My routine is:
Prompt 1: <define task> Do not write any code yet. Ask any questions you need for clarification now.
Prompt 2: <answer questions> Do not write any code yet. What additional questions do you have?
Reiterate until questions become unimportant.
They don’t know what to ask. They only assemble questions according to training data.
It seems like you are trying to steer toward a different point or topic.
In the course of my work, I have found they ask valuable clarifying questions. I don’t care how they do it.
While true, the questions are all points where the LLM would have "assumed" an answer and by asking you get to point in the right direction instead.
Can you give me the full content of the ideal context of what you mean here?
Certainly!
[flagged]
The questions it asks are usually domain specific and pertaining to the problem, like modeling or „where do I get this data from ideally“.
Not blaming you, it's actually genius. You're simulating what it's seen, and therefore getting the end result -- peer discussed and reviewed SO code.
Although you're being voted down probably for tone, this is a very interesting point.
Is it? By far the majority of code the LLMs are trained on is going to be from Git repositories. So the idea that stack overflow question and answer sections with buggy code is dominating the training sets seems unlikely. Perhaps I'm misunderstanding?
> Perhaps I'm misunderstanding?
The post wasn't saying that StackOverflow Q&A sections with buggy code dominate the training sets. They're saying that despite whatever amount of code in there from Git repositories, the process of generating and debugging code cannot be found in the static code that exists in Github repos; that process is instead encoded in the conversations on SO, git hub issues, various forums, etc. So if you want to start from buggy code and go to correct code in the way the LLM was trained, you would do that by simulating the back and forth found in a SO question, so that when the LLM is asked for the next step, it can rely on its training.
Thanks! Okay, I agree it's an interesting concept, but I'm not sure if it's actually true, but I can see why it might be. I appreciate your clarification!
I took gp to be a complaint that you had to sort of go through this buggy code loop over and over because of how the LLM was trained. Maybe I read sarcasm at the end if the post when there was none.
Please keep comments in this thread non-fiction.
Lolll everything in this thread is unfalsifiable conjecture??
> If you are a heavy user, you should use pay-as-you go pricing
if you’re a heavy user you should pay for a monthly subscription for Claude Code which is significantly cheaper than API costs.
Define heavy... There's a band where the max subscription makes most sense. Thread here talks $1000/month, the plan beats that. But there's a larger area beyond that where you're back to having to use API or buy credits.
A full day of Opus 4.1 or GPT 5 high reasoning doing pair programming or guided code review across multiple issues or PRs in parallel will burn the max monthly limits and then stop you or cost $1500 in top up credits for a 15 hour day. Wait, WTF, that's $300k/year! OK, while true, misses that that's accomplishing 6 - 8 in parallel, all day, with no drop in efficacy.
At enterprise procurement cost rates, hiring a {{specific_tech}} expert can run $240/hr or $3500/day and is (a) less knowledgable on the 3+ year old tech the enterprise is using, (b) wants to advise instead of type.
So the question then isn't what it costs, it's what's the cost of being blocked and in turn blocking committers waiting for reviews? Similarly, what's the cost of a Max for a dev that doesn't believe in using it?
TL;DR: At the team level, for guided experts and disbelievers, API likely ends up cheaper again.
Yeah, this is a no brainer for certain use cases.
Am I alone in spending $1k+/month on tokens? It feels like the most useful dollars i've ever spent in my life. The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.
> The software I've been able to build on a whim over the last 6 months is beyond my wildest dreams from a a year or two ago.
If you don't mind sharing, I'm really curious - what kind of things do you build and what is your skillset?
Not OP but I know it can be difficult to really difficult to measure or communicate this to people who aren't familiar with the codebase or the problem being solved.
Other than just dumping 10M tokens of chats into a gist and say read through everything I said back and forth with claude for a week.
But, I think I've got the start of a useful summary format: it that takes every prompt and points to the corresponding code commit produced by ai + adds a line diff amount and summary of the task. Check it out below.
https://github.com/sutt/agro/blob/master/docs/dev-summary-v1...
(In this case it's an python cli ai-coding framework that I'm using to build the package itself)
I’m unclear how you’re hitting $1k/mo in personal usage. GitHub Copilot charges $0.04 per task with a frontier model in agent mode - and it’s considered expensive. That’s 850 coding tasks per day for $1k/mo, or around 1 per minute in a 16hr day.
I’m not sure a single human could audit & review the output of $1k/mo in tokens from frontier models at the current market rate. I’m not sure they could even audit half that.
You don't audit and review all $1k worth of tokens!
The AI might write ten versions. Versions 1-9 don't compile, but it automatically makes changes and gets further each time. Version 10 actually builds and seems to pass your test suite. That is the version you review!
—and you might not review the whole thing! 20 lines in, you realize the AI has taken a stupid approach that will obviously break, so you stop reading and tell the AI it messed up. This triggers another ~5 rounds of producing code before something compiles, which you can then review, hopefully in full this time if it did a good job.
I can easily hit the daily usage limits on Claude Code or Openai Codex by asking for more complex tasks to be done which often take relatively little time to review.
There's a lot of tokens used up quickly for those tools to query the code base, documentation, try changes, run commands, re-run commands to call tools correctly, fix errors, etc.
Audit and review? Sounds like a vibe killer.
Do people actually use GitHub copilot?
At any rate, I could easily go through that much with Opus because it’s expensive and often I’m loading the context window to do discovery, this may include not only parts of a codebase but also large schemas along with samples of inputs and outputs.
When I’m done with that, I spend a bunch of turns defining exactly what I want.
Now that MCP tools work well, there is also a ton of back and forth that happens there (this is time efficient, not cost efficient). It all adds up.
I have Claude code max which helps, but one of the reasons it’s so cheap is all of the truncation it does, so I have a different tool I use that lets me feed in exactly the parts of a codebase that I want to, which can be incredibly expensive.
This is all before the expenses associated with testing and evals.
I’m currently consulting, a lot of the code is ultimately written by me, and everything gets validated by me (if the LLM tells me about how something works, I don’t just take its word for it, I go look myself), but a lot of the work for me happens before any code is actually written.
My ability (usually clarity of mind and patience) to review an LLMs output is still a gating factor, but the costs can add up quickly.
> Do people actually use GitHub copilot?
I use it all the time. I am not into claude code style agentic coding. More of the "change the relevant lines and let me review" type.
I work in web dev, with vs code I can easily select a line of code that's wrong which I know how to fix but honestly tired to type, press Ctrl+I and tell it to fix. I know the fix, I can easily review it.
Gpt 4.1 agent mode is unlimited in the pro tier. It's half the cost of claude, gemini, and chatgpt. The vs code integration alone is worth it.
Now that is not the kind of AI does everything coding these companies are marketing and want you to do, I treat it like an assistant almost. For me it's perfect.
That’s good to know, I haven’t tried it for a few years.
I trust Copilot way more than any agentic coder. First time I used Claude it went through my working codebase and tried to tell me it was broken in all these places it wasn't. It suggested all these wrong changes that if applied would have ruined my life. Given that first impression, it's going to take a lot to convince me agentic coding is a worthwhile tool. So I prefer Copilot because it's a much more conservative approach to adding AI to my workflow.
Care to show what you've built?
> Am I alone in spending $1k+/month on tokens?
I would if there were any positive ROI for these $12k/year, or if it were a small enough fraction of my income. For me, neither are true, so I don’t :).
Like the siblings I would be interested in having your perspective on what kind of thing you do with so many tokens.
If freelancing and if I am doing 2x as much as previously with same time, it would make sense that I am able to make 2x as much. But honestly to me with many projects I feel like I was able to scale my output far more than 2x. It is a different story of course if you have main job only. But I have been doing main job and freelancing on the side forever now.
I do freelancing mostly for fun though, picking projects I like, not directly for the money, but this is where I definitely see multiples of difference on what you can charge.
You're not alone in using $1k+/month in tokens. But if you are spending that much, you should definitely be on something like Anthropic's Max plan instead of going full API, since it is a fraction of the cost.
I'm starting to notice a pattern with these kinds of comments. They almost never provide any actual evidence for the work they mention.
I would personally never. Do I want to spend all my time reviewing AI code instead of writing? Not really. I also don't like having a worse mental model of the software.
What kind of software are you building that you couldn't before?
> One of the weird things I found out about agents is that they actually give up on fixing test failures and just disable tests. They’ll try once or twice and then give up.
Its important to not think in terms of generalities like this. How they approach this depends on your tests framework, and even on the language you use. If disabling tests is easy and common in that language / framework, its more likely to do it.
For testing a cli, i currently use run_tests.sh and never once has it tried to disable a test. Though that can be its own problem when it hits 1 it can't debug.
# run_tests.sh # Handle multiple script arguments or default to all .sh files
scripts=("${@/#/./examples/}")
[ $# -eq 0 ] && scripts=(./examples/*.sh)
for script in "${scripts[@]}"; do
doneecho " OK"
----
Another tip. For a specific tasks don't bother with "please read file x.md", Claude Code (and others) accept the @file syntax which puts that into context right away.
Some of these sample prompts in this blog post are extremely verbose:
If you are considering leveraging any of the documentation or examples, you need to validate that the documentation or example actually matches what is currently in the code.
I have better luck being more concise and avoiding anthropomorphizing. Something like:
"validate documentation against existing code before implementation"
Should accomplish the same thing!
I have the best luck with RFC speak. “You MUST validate that the documentation validates existing code before implementation. You MAY update documentation to correct any mismatches.”
But I also use more casual style when investigating. “See what you think about the existing inheritance model, propose any improvements that will make it easier to maintain. I was thinking that creating a new base class for tree and flower to inherit from might make sense, but maybe that’s over complicating things”
(Expressing uncertainty seems to help avoid the model latching on to every idea with “you’re absolutely right!”)
RFC speak is a good way to put it.
Also, there's a big difference between giving general "always on" context (as in agents.md) for vibe coding - like "validate against existing code" etc - versus bouncing ideas in a chat session like your example, where you don't necessarily have a specific approach in mind and burning a few extra tokens for a one off query is no big deal.
Context isn't free (either literally or in terms of processing time) and there's definitely a balance to be found for a given task.
I've had both experiences. On some projects concise instructions seem to work better. Other times, the LLM seems to benefit from verbosity.
This is definitely a way in which working with LLMs is frustrating. I find them helpful, but I don't know that I'm getting "better" at using them. Every time I feel like I've discovered something, it seems to be situation specific.
Not my experience at all. I find that the shorter my prompts, the more garbage the results. But if I give it a lot of detail and elaborate on my thought process, it performs very well, and often one-shots the solution.
> If you are a heavy user, you should use pay-as-you go pricing; TANSTAAFL.
This is very very wrong. Anthropic's Max plan is like 10% of the cost of paying for tokens directly if you are a heavy user. And if you still hit the rate-limits, Claude Code can roll-over into you paying for tokens through API credits. Although, I have never hit the rate limits since I upgraded to the $200/month plan.
The blogpost is transparently an advertisement, which is ironic considering the author's last blogpost was https://blog.efitz.net/blog/modern-advertising-is-litter/
In this case the lunch is being paid by VC money.
I acknowledge that and get like $400 worth of tokens from my $20 Claude Code Pro subscription every month.
I'm building tools I can use when the VC money runs out or a clear winner gets on top and the prices shoot up to realistic levels.
At that point I've hopefully got enough local compute to run a local model though.
If I paid for my API usage directly instead of the plan it'd be like a second mortgage.
To be fair, allocating some token for planning (recursively) helps a lot. It requires more hands on work, but produce much better results. Clarifying the tasks and breaking them down is very helpful too. Just you end up spending lots of time on it. On the bright side, Qwen3 30B is quite decent, and best of all "free".
> Finally it occurred to me to put context where it was needed - directly in the test files.
Probably CLAUDE.md is a better place?
> Too much context
Claude’s Sub-agents[1] seems to be a promising way of getting around this, though I haven’t had time to play with the feature too much. Eg when you need to take a context-busting action like debugging dependencies, instead spin up a new agent to read the output and summarize. Then your top-level context doesn’t get polluted.
[1]: https://docs.anthropic.com/en/docs/claude-code/sub-agents
I've added a subagent to read the "memory_bank" files for project context after being told the task at hand, and summarize only the pertinent parts for the main agent. This is working well to keep the context focused.
https://gist.github.com/nicwolff/273d67eb1362a2b1af42e822f6c...
I wrote three sub agents this week, one to run unit tests, another to run playwright and a third to write playwright. These are pretty basic boundaries that aren't hard to share context between agent and the orchestrating main agent. It seemed to help a lot. I also have complex ways to run tests (docker, data seeding, auth) and previously it was getting lost. Only compacted a couple times. Was a big improvement.
As a human dev, can I humbly ask you to separate out your LLM "readme" from your human README.md? If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere. If you're putting copy in your codebase that's instructions for a bot, that isn't a README.md. By all means come up with a new convention e.g. BOTS.md for this. As a human dev I know I can safely ignore such a file unless I am working with a bot.
I think things are moving towards using AGENTS.md files: https://agents.md/ . I’d like something like this to become the consensus for most commonly used tools at some point.
There was a discussion here 3 days ago: https://news.ycombinator.com/item?id=44957443 .
> If I see a README.md in a directory I assume that means the directory is a separate module that can be split out into a separate repo or indeed storage elsewhere.
While I can understand why someone might develop that first-impression, it's never been safe to assume, especially as one starts working with larger projects or at larger organizations. It's not that unusual for essential sections of the same big project to have their own make-files, specialized utility scripts, tweaks to auto-formatter, etc.
In other cases things are together in a repo for reasons of coordination: Consider frontend/backend code which runs with different languages on different computers, with separate READMEs etc. They may share very little in terms of their build instructions, but you want corresponding changes on each end of their API to remain in lockstep.
Another example: One of my employer's projects has special GNU gettext files for translation and internationalization. These exist in a subdirectory with its own documentation and support scripts, but it absolutely needs to stay within the application that is using it for string-conversions.
While I agree keep Readme a for humans, Readme literally means read me.
Not 'this is a separate project'. Not 'project documentation file'.
You can have read mes dotted all over a project if that's necessary.
It's simply a file that a previous developer is asking you to read before you start making around in that directory.
This lines up with my own experience of learning how to succeed with LLMs. What really makes them work isn't so different from what leads to success in any setting: being careful up front, measuring twice and cutting once.
I’ve seen going very successfully using both codex with gpt5 and claude code with opus. You develop a solution with one, then validate it with the other. I’ve fixed many bugs by passing the context between them saying something like: “my other colleague suggested that…”. Bonus thing: I’ve started using symlinks on CLAUDE.md files pointing at AGENTS.md, now I don’t even have to maintain two different context files.
Other symlinks one can do: QWEN.md, GEMINI.md, CONVENTIONS.md (for aider).
I spent much of the last several months using LLM agents to create software. I've written two blog posts about my experience; this is the second post that includes all the things I've learned along the way to get better results, or at least waste less money.
you should write more about your experience using LLM. Is this solely using LLM?
Part 1: https://efitz-thoughts.blogspot.com/2025/08/my-experience-cr...
https://gist.github.com/snissn/4f06cae8fb4f4ac43ffdb104db192...
His profile says: "I'm a technology geek and do information security for a living."
The blog post starts with: "I’m not a professional developer, just a hobbyist with aspirations."
Is this a vibe blog promoting Misanthropic Claude Vibe? It is hard to tell, since all "AI" promotion blogs are unstructured and chaotic.
Hmm, to my eye those descriptors (profile and blog post intro) aren't contradictory.
[dead]
[flagged]
If you kept reading you'd realize the guy was just humble bragging.
Why?
I guess you need an active developer license to write blog posts
Or maybe this industry still trusts experienced software engineers to write well maintained and robust software used by millions that make money.
It's quite simple.
I perfer building and using software that is robust, heavily tested and thoroughly reviewed by highly experienced software engineers who understand the code, can detect bugs and can explain what each line of code they write does.
Today, we are now in the phase where embracing mediocre LLM generated code over heavily tested / scrutinized code is now encoraged in this industry - because of the hype of 'vibe coding'.
If you can't even begin to explain the code or point out any bugs generated by LLMs or even off-load architectural decisions to them, you're going to have a big problem in explaining that in code review situations or even in a professional pair-programming scenario.
Unfortunately, all of modern software depends on some random obscure dependency that is not properly reviewed https://xkcd.com/2347/
It's a funny comic, but can you actually give an example of what it's talking about? "Properly reviewed" can be construed as "has been working for a long time for a lot of people", which definitely can't be said about any AI process or any AI generated code. At the very least, 1 human person actually sat down and wrote the tools the comic is poking fun at. But with AI, we are currently producing code that was neither peer reviewed nor written (a process which includes revision) -- it was instead "generated". So it's still a step backwards.
> I perfer building and using software that is robust, heavily tested and thoroughly reviewed by highly experienced software engineers who understand the code, can detect bugs and can explain what each line of code they write does.
that's amazing. by that logic you probably use like one or two pieces of software max. no windows, macos or gnome for you.
LOL.. I was going to say after working in the tech industry.. half the time it is a rats nest in there.
There are excellent engineers.. but their are also many not so great engineers and once the sausage is made it usually isn't a pretty picture inside.
Usually only small young projects or maybe a beautiful component or two. Almost never an entire system/application.
exactly, as smart engineers end up having to work with midwits and it's not going to be pretty.
> I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones.
One vast difference - he (and all the other contributors) understood each line of code they wrote for the Linux kernel.
LLM agents doing all the work without you understanding the code they write is even far worse for a hobbyist than it is for a professional.
So having no understanding about the code that LLMs write, fails the first test in inspiring confidence in building robust software.