Useful and useless (or good and “less good”) aren’t easily mapped to big and small.
From a purely UX perspective, showing a red badge seems you’re conflating “less good” with size. Who is the target for this? Lots of useful codebases are large.
I do agree, however, that there’s value in splitting up domains into something a human can easily learn and keep in their head after, say, a few days of being deeply entrenched. Tokens could actually be a good proxy for this.
This is a really interesting metric to track. I agree with the sentiment that token budgets are becoming the new 'lines of code' metric. Even though context windows are constantly expanding (like the 200k default you used for Opus), there's still a tangible benefit to keeping a codebase lean. It's not just about fitting it into the window, but also about the signal-to-noise ratio for the agent. The color-coding based on percentage is a nice touch for a quick visual health check.
It’s interesting but I think it’s measuring the wrong thing. Abstraction is a fundamental principle in software. As a human, I’ve worked with classes and modules far larger than what fits in my head, just because I’m only fitting the function signatures and purpose into my head, and not the implementation details. In practice I find Claude really good at extracting useful information in a human-like way from a codebase. It doesn’t usually stuff the entire codebase into its context window.
Gemini 1.5 Announced the 1 million token context window in 2024. I admire this view of being forward looking towards new technologies, specially when we see the history of how bad people can be at predictions just by looking at history HN posts/comments.
If we look at back 2 years, companies weren't investing into training their LLMs so heavily on code. Any code they got their hands on was what was in the LLMs training corpus, it's well known that the most recent improvements in LLM productivity occurred after they spent millions on different labs to produce more coding datasets for them.
So while LLMs have gotten a lot better at not needing the entire codebase in context at once, because their weights are already so well tuned to development environments they can better infer and index things as needed. However, I fail to see how the context window limitation would no longer be an issue since it's a fundamental part of the real world. Would we get better and more efficient ways of splitting and indexing context windows? Surely. Will that reduce our fear of soiling our contexts with bad prompt response cycles? Probably not...
I’m not so sure an increasingly large context window will be seen as a critical enabler (as it was viewed 6 months ago), after watching how amazingly effective subagents and tool calls are at tackling parts of the problem and surfacing the just the relevant bits for the task at hand. And if increasing the context window isn’t the current bottleneck, effort will be put elsewhere.
I agree. My suspicion is that token efficiency is what will drive more efficient tool calls, and tool building. And we want that. Agents should rely less on raw intelligence (ability to hold everyting in context), and more on building tools to get the job done.
This is an interesting concept. Thank you for sharing. I have an export.sh or export.ps1 script that takes the relevant files in my repository and puts them in a `dump.txt` file inside `docs/llm`.
I am not very good with AI though. Is there a quick and easy way to calculate token count and add this to my dump.txt file, ideally using just simple, included by default Linux tools in bash or simple, included by default Windows tools in powershell?
This hits a pain point we've seen repeatedly in production AI agent work. Teams often underestimate token costs until they're burning through hundreds of dollars daily on context that could be trimmed.
The real challenge isn't just size - it's that most repos include massive amounts of irrelevant context (node_modules, generated files, outdated comments). We've seen 50+ file repos become <10 token-efficient files after proper filtering.
One pattern that works: maintain parallel "agent-ready" exports of your codebase. Keep the full repo for humans, but curate what agents actually need to see. The badge is brilliant because it makes token efficiency visible - like bundle size for the LLM era.
Context windows will grow, but efficiency always matters. Even with 1M+ token models, there's a massive difference between a 10k and 100k token codebase in terms of accuracy, cost, and latency.
Some say that the ideal size of an individual function in a codebase is related to the amount of information you can hold in working memory. Maybe the ideal size for a library is the amount you can fit in an LLM context window?
> What’s the going rate for tokens in terms of dollars?
It depends on the provider/model, usually pricing is calculated as $/million tokens with input/output tokens having different per token pricing (output tends to be more expensive than input). Some models also charge more per token if the context size is above a threshold. Cached operations may also reduce the price per token.
The math on what people are actually paying is hard to evaluate. Ime, most companies rather buy a subscription than give their developers API keys (as it makes spending predictable).
Useful and useless (or good and “less good”) aren’t easily mapped to big and small.
From a purely UX perspective, showing a red badge seems you’re conflating “less good” with size. Who is the target for this? Lots of useful codebases are large.
I do agree, however, that there’s value in splitting up domains into something a human can easily learn and keep in their head after, say, a few days of being deeply entrenched. Tokens could actually be a good proxy for this.
> Who is the target for this?
Agents. Going to be more tools and software targeted for consumption by agents
Yeah, but a large monorepo can consist of many small subprojects. And arguably this is becoming a best practice.
Just spawn the agent in one of the subprojects
This is a really interesting metric to track. I agree with the sentiment that token budgets are becoming the new 'lines of code' metric. Even though context windows are constantly expanding (like the 200k default you used for Opus), there's still a tangible benefit to keeping a codebase lean. It's not just about fitting it into the window, but also about the signal-to-noise ratio for the agent. The color-coding based on percentage is a nice touch for a quick visual health check.
It’s interesting but I think it’s measuring the wrong thing. Abstraction is a fundamental principle in software. As a human, I’ve worked with classes and modules far larger than what fits in my head, just because I’m only fitting the function signatures and purpose into my head, and not the implementation details. In practice I find Claude really good at extracting useful information in a human-like way from a codebase. It doesn’t usually stuff the entire codebase into its context window.
It's a fun, in the "style of the time" thing to track, but within a year or two, context window limitations won't be a thing.
Doubt me?
Think back 2 years. Now compare today. Change is at massive speed, and this issue is top line to be resolved in some fashion.
Gemini 1.5 Announced the 1 million token context window in 2024. I admire this view of being forward looking towards new technologies, specially when we see the history of how bad people can be at predictions just by looking at history HN posts/comments.
If we look at back 2 years, companies weren't investing into training their LLMs so heavily on code. Any code they got their hands on was what was in the LLMs training corpus, it's well known that the most recent improvements in LLM productivity occurred after they spent millions on different labs to produce more coding datasets for them.
So while LLMs have gotten a lot better at not needing the entire codebase in context at once, because their weights are already so well tuned to development environments they can better infer and index things as needed. However, I fail to see how the context window limitation would no longer be an issue since it's a fundamental part of the real world. Would we get better and more efficient ways of splitting and indexing context windows? Surely. Will that reduce our fear of soiling our contexts with bad prompt response cycles? Probably not...
I’m not so sure an increasingly large context window will be seen as a critical enabler (as it was viewed 6 months ago), after watching how amazingly effective subagents and tool calls are at tackling parts of the problem and surfacing the just the relevant bits for the task at hand. And if increasing the context window isn’t the current bottleneck, effort will be put elsewhere.
I agree. My suspicion is that token efficiency is what will drive more efficient tool calls, and tool building. And we want that. Agents should rely less on raw intelligence (ability to hold everyting in context), and more on building tools to get the job done.
Interesting, but not adding something to my CI for a badge, too paranoid.
This is an interesting concept. Thank you for sharing. I have an export.sh or export.ps1 script that takes the relevant files in my repository and puts them in a `dump.txt` file inside `docs/llm`.
I am not very good with AI though. Is there a quick and easy way to calculate token count and add this to my dump.txt file, ideally using just simple, included by default Linux tools in bash or simple, included by default Windows tools in powershell?
Thank you in advance.
This hits a pain point we've seen repeatedly in production AI agent work. Teams often underestimate token costs until they're burning through hundreds of dollars daily on context that could be trimmed.
The real challenge isn't just size - it's that most repos include massive amounts of irrelevant context (node_modules, generated files, outdated comments). We've seen 50+ file repos become <10 token-efficient files after proper filtering.
One pattern that works: maintain parallel "agent-ready" exports of your codebase. Keep the full repo for humans, but curate what agents actually need to see. The badge is brilliant because it makes token efficiency visible - like bundle size for the LLM era.
Context windows will grow, but efficiency always matters. Even with 1M+ token models, there's a massive difference between a 10k and 100k token codebase in terms of accuracy, cost, and latency.
Some say that the ideal size of an individual function in a codebase is related to the amount of information you can hold in working memory. Maybe the ideal size for a library is the amount you can fit in an LLM context window?
What’s the going rate for tokens in terms of dollars? How much are companies spending on “tokens”?
Also kind of ironic that small codebases are now in vogue, just when google monolithic repos were so popular.
> What’s the going rate for tokens in terms of dollars?
It depends on the provider/model, usually pricing is calculated as $/million tokens with input/output tokens having different per token pricing (output tends to be more expensive than input). Some models also charge more per token if the context size is above a threshold. Cached operations may also reduce the price per token.
OpenRouter has a good overview over provider and models, https://openrouter.ai/models
The math on what people are actually paying is hard to evaluate. Ime, most companies rather buy a subscription than give their developers API keys (as it makes spending predictable).
Api keys with hard limits I assume?
Are there companies out there that add token counts to ticket “costs”, i.e. are story points being replaced/augmented by token counts?
Or even worse, an exchange rate of story points to tokens used…
I'm not sure that smaller bases are always better.
Interesting concept, but is it going to age well with context sizes of models are changing all the time (growing, mostly)?
max context sizes are probably going to go up, but smaller contexts will always be cheaper/more-efficient than larger ones
Smart idea. Token budgets are becoming the new line count metric for the LLM era.
Nah. I can write a whole program using 0 tokens, I can’t write a whole program with 0 lines of code.