Expensively Quadratic: The LLM Agent Cost Curve

(blog.exe.dev)

62 points | by luu 3 days ago ago

29 comments

  • stuxf 6 hours ago

    > Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.

    disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.

    Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.

    • tekacs 4 hours ago

      I think that's reasonable, but then they should have the ability for the agent to, on the next call, override it. Even if it requires the agent to have read the file once or something.

      In the absence of that you end up with what several of the harnesses ended up doing, where an agent will use a million tool calls to very slowly read a file in like 200 line chunks. I think they _might_ have fixed it now (or agent-fixes, my agent harness might be fixing it), but Codex used to do this and it made it unbelievably slow.

  • the_harpia_io 3 hours ago

    the quadratic curve makes sense but honestly what kills us more is the review cost - AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep. we burn more time auditing AI output than we save on writing it, and that compounds. the API costs are predictable at least

    • netdevphoenix 2 hours ago

      > AI generates code fast but then you're stuck reading every line because it might've missed some edge case or broken something three layers deep

      I will imagine that in the future this will be tackled with a heavy driven approach and tight regulation of what the agent can and cannot touch. So frequent small PRs over big ones. Limit folder access to only those that need changing. Let it build the project. If it doesn't build, no PR submissions allowed. If a single test fails, no PR submissions allowed. And the tests will likely be the first if not the main focus in LLM PRs.

      I use the term "LLM" and not "AI" because I notice that people have started attributing LLM related issues (like ripping off copyrighted material, excessive usage of natural resources, etc) to AI in general which is damaging for the future of AI.

    • aurareturn 3 hours ago

      I disagree. I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.

      Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.

      Bugs that the AI agent would write, I would have also wrote. Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.

      I also find that the AI writes more bug free code than I did. It handles cases that I wouldn’t have thought of. It used best practices more often than I did.

      Maybe I was a bad dev before LLMs but I find myself producing better quality applications much quicker.

      • dns_snek an hour ago

        > Example is unexpected data that doesn’t match expectations. Can’t fault the AI for those bugs.

        I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully? Expectations should be defined, input validated, and anything that's unexpected should be rejected. Resilience against poorly formatted or otherwise nonsensical input is a pretty basic requirement.

        I hope I severely misunderstood what you meant to say because we can't be having serious discussions about how amazing this technology is if we're silently dropping the standards to make it happen.

        • aurareturn an hour ago

            I don't understand, how can you not fault AI for generating code that can't handle unexpected data gracefully?
          
          Because I, the spec writer, didn't think of it. I would have made the same mistake if I wrote the code.
        • the_harpia_io an hour ago

          yeah you're spot on - the whole "can't fault AI for bugs" mindset is exactly the problem. like, if a junior dev shipped code that crashed on malformed input we'd send it back for proper validation, why would we accept worse from AI? I keep seeing this pattern where people lower their bar because the AI "mostly works" but then you get these silent failures or weird edge case explosions that are way harder to debug than if you'd just written defensive code from the start. honestly the scariest bugs aren't the ones that blow up in your face, it's the ones that slip through and corrupt data or expose something three deploys later

      • netdevphoenix 2 hours ago

        > Now I first discuss with an AI Agent or ChatGPT to write a thorough spec before handing it off to an agent to code it. I don’t read every line. Instead, I thoroughly test the outcome.

        This is likely the future.

        That being said: "I used to spend most of my time writing code, fixing syntax, thinking through how to structure the code, looking up documentation on how to use a library.".

        If you are spending a lot of time fixing syntax, have you looked into linters? If you are spending too much time thinking about how to structure the code, how about spending some days coming up with some general conventions or simply use existing ones.

        If you are getting so much productivity from LLMs, it is worth checking if you were simply unproductive relative to your average dev in the first place. If that's the case, you might want to think, what is going to happen to your productivity gains when everyone else jumps on the LLM train. LLMs might be covering for your unproductivity at the code level, but you might still be dropping the ball in non-code areas. That's the higher level pattern I would be thinking about.

        • aurareturn 2 hours ago

          I was a good dev but I did not love the code itself. I loved the outcome. Other devs would have done better on leetcode and they would have produced better code syntax than me.

          I’ve always been more of a product/business person who saw code as a way to get to the end goal.

          That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.

          Hence, LLMs have been far better for me in terms of productivity.

          • skydhash 2 hours ago

            > I’ve always been more of a product/business person who saw code as a way to get to the end goal.

            That’s what code always is. A description on how the computer can help someone faster to the end goal. Devs care a little more about the description, because end goals change and rewriting the whole thing from scratch is costly and time-consuming.

            > That elite coder who hates talking to business people and who cares more about the code than the business? Not me. I’m the opposite.

            I believe that coder exists only in your imagination. All the good ones I know are great communicators. Clarity of thought is essential to writing good code.

            • aurareturn an hour ago

                I believe that coder exists only in your imagination. All the good ones I know are great communicators. Clarity of thought is essential to writing good code.
              
              I don't think so. These coders exist everywhere. Plenty of great coders are great at writing the code itself but not at the business aspects. Many coders simply do not care about the business or customers part. To them, the act of coding and producing quality code and the process of writing software is the goal. IE. These people are most likely to decline building a feature that customers and the business desperately need because it might cause the code base to become harder to maintain. These people will also want to refactor more than building new features. In the past, these people had plenty of value. In the era of LLMs, I think these people have less value than business/product oriented devs.
      • adrianN 3 hours ago

        You have way more trust in test suites than I do. How complex is the code you’re working with? In my line of work most serious bugs surface in complex interactions between different subsystems that are really hard to catch in a test suite. Additionally in my experience the bugs AI produces are completely alien. You can have perfect code for large functions and then somewhere in the middle absolutely nonsensical mistakes. Reviewing AI code is really hard because you can’t use your normal intuitions and really have to check everything meticulously.

        • aurareturn 2 hours ago

          If it’s hard to catch with a comprehensive suit of test, what makes you think you can catch them by hand coding?

          • skydhash an hour ago

            A great lot of thinking about the code, which you can only do if you’re very familiar with it. Writing the code is trivial. I spend nearly all my work hours thinking about edge cases.

            • aurareturn an hour ago

              If I vibe coded an app without looking at every line, I'm very familiar with how the app works and should work. It's just a different level of abstraction. It doesn't mean I'm not thinking about edge cases if I'm vibe coding. In fact, I might think about them more since I will have more time to think about them without having to write the code.

  • alexhans 4 hours ago

    Nice article. I think a key part of the conversation is getting people to start thinking in terms of evals [1] and observability but it's been quite tough to combat the hype of "but X magic product just solves what you mentioned as a concern for you".

    You'd think cost is an easy talking point to help people care but the starting points for people are so heterogeneous that it's tough to show them they can take control of this measurement themselves.

    I say the latter because the article is a point in time and if they didn't have a recurrent observation around this, some aspects may radically change depending on the black box implementations of the integrations they depend on (or even the pricing strategies).

    [1] https://ai-evals.io/

  • 0-_-0 an hour ago

    The cache gets read at every token generated, not at every turn on the conversation.

    • mzl an hour ago

      Depends on which cache you mean. The KV Cache gets read on every token generated, but the prompt cache (which is what incurs the cache read cost) is read on conversation starts.

      • 0-_-0 an hour ago

        What's in the prompt cache?

        • bsenftner an hour ago

          Way too much. This has got to be the most expensive and most lacking in common sense way to make software ever devised.

  • seyz 2 hours ago

    128k tokens sounds great until you see the bill

  • TZubiri 5 hours ago

    I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.

    Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.

    To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

    This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.

    So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.

    • NitpickLawyer 4 hours ago

      While some efficiencies could be gained from better client-server negotiation, the cost will never be 0. It isn't 0 even in "lab conditions", so it can't be 0 at scale. There are a few miss-conceptions in your post.

      > the time it takes to generate the Millionth output token is the same as the first output token.

      This is not true, even if you have the kv cache "hot" in vram. That's just not how transformers work.

      > cached input tokens are almost virtually free naturally

      No, they are not in practice. There are pure engineering considerations here. How do you route, when you evict kv cache, where you evict it to (RAM/nvme), how long you keep it, etc. At the scale of oAI/goog/anthropic these are not easy tasks, and the cost is definetly not 0.

      Think about a normal session. A user might prompt something, wait for the result, re-prompt (you hit "hot" cache) and then go for a coffee. They come back 5 minutes later. You can't keep that in "hot" cache. Now you have to route the next message in that thread to a) a place where you have free "slots"; b) a place that can load the kv cache from "cold" storage and c) a place that has enough "room" to handle a possible max ctx request. These are not easy things to do in practice, at scale.

      Now consider 100k users doing basically this, all day long. This is not free and can't become free.

    • 2001zhaozhao 4 hours ago

      Caching might be free, but I think making caching cost nothing at the API level is not a great idea either considering that LLM attention is currently more expensive with more tokens in context.

      Making caching free would price "100000 token cache, 1000 read, 1000 write" the same as "0 token cache, 1000 read, 1000 write", whereas the first one might cost more compute to run. I might be wrong at the scale of the effect here though.

    • mike_hearn 3 hours ago

      GPU VRAM has an opportunity cost, so caching is never free. If that RAM is being used to hold KV caches in the hope that they'll be useful in future, but you lose that bet and you never hit that cache, you lost money that could have been used for other purposes.

    • eshaham78 5 hours ago

      This matches my experience running coding agents at scale. The cached token pricing is indeed somewhat artificial - in practice, for agent workflows with repeated context (like reading the same codebase across multiple tasks), you can achieve near-zero input costs through strategic caching. The real cost optimization isn't just token pricing but minimizing the total tokens flowing through the loop through better tool design.

      • 2001zhaozhao 4 hours ago

        Are you hosting your own infrastructure for coding agents? At least from first glance, sharing actual codebase context across compacts / multiple tasks seems pretty hard to pull off with good cost-benefit unless you have vertical integration from the inference all the way to the coding agent harness.

        I'm saying this because the current external LLM providers like OpenAI tend to charge quite a bit for longer-term caching, plus the 0.1x cache read cost multiplied by # LLM calls, so I doubt context sharing would actually be that beneficial considering you won't need all the repeated context every time, so caching context results in longer context for each agentic task which might increase API costs by more overall than you save by caching.

  • jauntywundrkind 6 hours ago

    Very awesome to see these numbers, to see this explored so. Nice job exe.dev.