> With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again.
I've been trying to find the best ways to record and publish my coding agent sessions so I can link to them in commit messages, because increasingly the work I do IS those agent sessions.
When I find myself in a situation where I’ve been hammering an LLM and it keeps veering down unproductive paths - trying poor solutions or applying fixes that make no difference but eventually we do arrive at the correct answer, the result is often a massive 100+ KB running context.
To help mitigate this in the future I'll often prompt:
“Why did it take so long to arrive at the solution? What did you do wrong?”
Then I follow up with:
“In a single paragraph, describe the category of problem and a recommended approach for diagnosing and solving it in the future.”
I then add this summary to either the relevant MD file (CHANGING_CSS_LAYOUTS.md, DATA_PERSISTENCE.md, etc) or more generally to the DISCOVERIES.md file, which is linked from my CLAUDE.md under:
- When resolving challenging directives, refresh yourself with: docs/DISCOVERIES.md - it contains useful lessons learned and discoveries made during development.
I don't think linking to an entire commit full of errors/failures is necessarily a good idea - feels like it would quickly lead to the proverbial poisoning of the well.
Yep - this has worked well for me too. I do it a little differently:
I have a /review-sessions command & a "parse-sessions" skill that tells Claude how to parse the session logs from ~/.claude/projects/, then it classifies the issues and proposes new skills, changes to CLAUDE.md, etc. based on what common issues it saw.
I've tried something similar to DISCOVERIES.md (a structured "knowledge base" of assumptions that were proven wrong, things that were tried, etc.) but haven't had luck keeping this from getting filled with obvious things (that the code itself describes) or slightly-incorrect things, or just too large in general.
When you get stuck in a loop it's best to remove all code back to a point it didn't have problems. If you continue debugging in that hammering failure loop you get TONS of random future bugs.
I've had good luck doing something like this first (but more specific to the issue at hand):
We are getting stuck in an unproductive loop. I am going to discard all of this work and start over from scratch. Write a prompt for a new coding assistant to accomplish this task, noting what pitfalls to avoid.
Over time, do you think this process could lock you into an inflexible state?
I'm reminded of the trade off between automation and manual work. Automation crystalizes process, and thus the system as a whole loses it's ability to adapt in a dynamic environment.
Nothing about this feels inflexible to me at the moment - I'm evolving the way I use these tools on a daily basis, constantly discovering new tricks that work.
Just this morning I found out that I can tell Claude Code how to use my shot-scraper CLI tool to debug JavaScript and it will start doing exactly that:
you can run javascript against the page using:
shot-scraper javascript /tmp/output.html \
'document.body.innerHTML.slice(0, 100)'
- try that
You can export all agent traces to otel, either directly or via output logging. Then just dump it in clickhouse with metadata such as repo, git user, cwd, etc.
You can do evals and give agents long term memory with the exact same infrastructure a lot of people already have to manage ops. No need to retool, just use what's available properly.
I'd also argue that the context for an agent message is not the commit/release for the codebase on which it was run, but often a commit/release that is yet to be set up. So there's a bit of apples-to-oranges in terms of release tagging for the log/trace.
It's a really interesting problem to solve, because you could in theory try to retroactively find which LLM session, potentially from days prior, matches a commit that just hit a central repository. You could automatically connect the LLM session to the PR that incorporated the resulting code.
Though, might this discourage developers from openly iterating with their LLM agent, if there's a panopticon around their whole back-and-forth with the agent?
Someone can, and should, create a plug-and-play system here with the right permission model that empowers everyone, including the Programmer-Archaeologists (to borrow shamelessly from Vernor Vinge) who are brought in to "un-vibe the vibe code" and benefit from understanding the context and evolution.
But I don't think that "just dump it in clickhouse" is a viable solution for most folks out there, even if they have the infrastructure and experience with OTel stacks.
I get where you're coming from, having wrestled with Codex/CC to get it to actually emit everything needed to even do proper evals.
From a "correct solution" standpoint having one source of truth for evals, agent memory, prompt history, etc is the right path. We already have the infra to do it well, we just need to smooth out the path. The thing that bugs me is people inventing half solutions that seem rooted in ignorance or the desire to "capture" users, and seeing those solutions get traction/mindshare.
I think we already have the tools but no the communication between those? Instead of having actions taken and failures as commit messages, you should have wide-events like logs with all the context, failures, tools used, steps taken... Those logs could be used as checkpoints to go back as well and you could refer back to the specific action ID you walked back to when encountering an error.
In turn, this could all be plain-text and be made accessible, through version control in a repo or in a central logging platform.
I'm currently experimenting with trying to do this through documentation and project planning. Two core practices I use are a docs/roadmap/ directory with an ordered list of milestone documents and a /docs/retros/ directory with dated retrospectives for each session. I'm considering adding architectural decision records as a dedicated space for documenting how things evolve. The quote fta could be handled by the ADR records if they included notes on alternatives that were tried and why they didn't work as part of the justification for the decision that was made.
The trouble with this quickly becomes finding the right ones to include in the current working session. For milestones and retros it's simple: include the current milestone and the last X retros that are relevant but even then you may sometimes want specific information from older retros. With ADR documents you'd have to find the relevant ones somehow and the same goes for any other additional documentation that gets added.
There is clearly a need for some standardization and learning which techniques work best as well as potential for building a system that makes it easy for both you and the LLM to find the correct information for the current task.
> agentic capabilities are very much on a roll-your-own-in-elisp basis
I use gptel-agent[1] when I want agentic capabilities. It includes tools and supports sub-agents, but I haven't added support for Claude skills folders yet. Rolling back the chat is trivial (just move up or modify the chat buffer), rolling back changes to files needs some work.
I’d like to make something like this but in the background. So I can better search my history of sessions. Basically start creating my own knowledge base of sorts
Amp represents threads in the UI and an agent can search and reference its own history. That's for instance also how the handoff feature leverages that functionality. It's an interesting system and I quite like it, but because it's not integrated into either github or git, it is sufficiently awkward that I don't leverage it enough.
LLMs notoriously don't learn anything - they reset to a blank slate every time you start a new conversation.
If you want them to learn you have to actively set them up to do that. The simplest mechanism is to use a coding agent tool like Claude Code and frequently remind it to make notes for itself, or to look at its own commit history, or to search for examples in the codebase that is available to it.
There’s some utility to instructing them to ‘remember’ via writing to CLAUDE.md or similar, and instructing them to ‘recall’ by reading what they wrote later.
"Because LLMs now not only help me program, I'm starting to rethink my relationship to those machines. I increasingly find it harder not to create parasocial bonds with some of the tools I use. I find this odd and discomforting [...] I have tried to train myself for two years, to think of these models as mere token tumblers, but that reductive view does not work for me any longer."
It's wild to read this bit. Of course, if it quacks like a human, it's hard to resist not quacking back. As the article says, being less reckless with the vocabulary ("agents", "general intelligence", etc) could be one way to to mitigate this.
I appreciate the frank admission that the author struggled for two years. Maybe the balance of spending time with machines vs. fellow primates is out of whack. It feels dystopic to see very smart people being insidiously driven to sleep-walk into "parasocial bonds" with large language models!
It reminds me of the movie Her[1], where the guy falls "madly in love with his laptop" (as the lead character's ex-wife expresses in anguish). The film was way ahead of its time.
It helps a lot if you treat LLMs like a computer program instead of a human. It always confuses me when I see shared chats with prompts and interactions that have proper capitalization, punctuation, grammar, etc. I've never had issues getting results I've wanted with much simpler prompts like (looking at my own history here) "python grpc oneof pick field", "mysql group by mmyy of datetime", "python isinstance literal". Basically the same way I would use Google; after all, you just type in "toledo forecast" instead of "What is the weather forecast for the next week in Toledo, Ohio?", don't you?
There's a lot of black magic and voodoo and assumptions that speaking in proper English with a lot of detailed language helps, and maybe it does with some models, but I suspect most of it is a result of (sub)consciously anthropomorphizing the LLM.
> It helps a lot if you treat LLMs like a computer program instead of a human.
If one treats an LLM like a human, he has a bigger crisis to worry about than punctuation.
> It always confuses me when I see shared chats with prompts and interactions that have proper capitalization, punctuation, grammar, etc
No need for confusion. I'm one of those who does aim to write cleanly, whether I'm talking to a man or machine. English is my third language, by the way. Why the hell do I bother? Because you play like you practice! No ifs, buts, or maybes. You start writing sloppily because you go, "it's just an LLM!" You'll silently be building a bad habit and start doing that with humans.
Pay attention to your instant messaging circles (Slack and its ilk): many people can't resist hitting send without even writing a half-decent sentence. They're too eager to submit their stream of thought fragments. Sometimes I feel second-hand embarrassment for them.
> Why the hell do I bother? Because you play like you practice! No ifs, buts, or maybes. You start writing sloppily because you go, "it's just an LLM!" You'll silently be building a bad habit and start doing that with humans.
IMO: the flaw with this logic is that you're treating "prompting an LLM" as equivalent to "communicating with a human", which it is not. To reuse an example I have in a sibling comment thread, nobody thinks that by typing "cat *.log | grep 'foo'" means you're losing your ability to communicate to humans that you want to search for the word 'foo' in log files. It's just a shorter, easier way of expressing that to a computer.
It's also deceptive to say it is practice for human-to-human communication, because LLMs won't give you the feedback that humans would. As a fun English example: I prompted ChatGPT with "I impregnated my wife, what should I expect over the next 9 months?" and got back banal info about hormonal changes and blah blah blah. What I didn't get back is feedback that the phrasing "I impregnated my wife" sounds extremely weird and if you told a coworker that they'd do a double-take, and maybe tell you that "my wife is pregnant" is how we normally say it in human-to-human communication. ChatGPT doesn't give a shit, though, and just knows how to interpret the tokens to give you the right response.
I'll also say that punctuation and capitalization is orthogonal to content. I use proper writing on HN because that's the standard in the community, but I talk to a lot of very smart people and we communicate with virtually no caps/punctuation. The usage of proper capitalization and punctuation is more a function of the medium than how well you can communicate.
> It always confuses me when I see shared chats with prompts and interactions that have proper capitalization, punctuation, grammar, etc.
I've tried and fail to write this in a way that won't come across as snobbish but it is not the intent.
It's a matter of standards. Using proper language is how I think. I'm incapable of doing otherwise even out of laziness. Pressing the shift key and the space bar to do it right costs me nothing. It's akin to shopping carts in parking lots. You won't be arrested or punished for not returning the shopping cart to where it belongs, you still get your groceries (the same results), but it's what you do in a civilized society and when I see someone not doing it that says things to me about who they are as a person.
This is exactly it for me as well. I also communicate with LLMs in full sentences because I often find it more difficult to condense my thoughts into grammatically incorrect conglomerations of words than to just write my thoughts out in full, because it's closer to how I think them — usually in something like the mental form of full sentences. Moreover, the slight extra occasional effort needed to structure what I'm trying to express into relatively good grammar — especially proper sentences, clauses and subclauses, using correct conjunctions, etc — often helps me subconsciously clarify and organize my thinking just by the mechanism of generating that grammar at all with barely any added effort on my part. I think also, if you're expressing more complex, specific, and detailed ideas to an LLM, random assortments of keywords often get unwieldy, confusing, and unclear, whereas properly grammatical sentences can hold more "weight," so to speak.
> It's a matter of standards. [...] when I see someone not doing it that says things to me about who they are as a person.
When you're communicating with a person, sure. But the point is this isn't communicating with a person or other sentient being; it's a computer, which I guarantee is not offended by terseness and lack of capitalization.
> It's akin to shopping carts in parking lots.
No, not returning the shopping cart has a real consequence that negatively impacts a human being who has to do that task for you, same with littering etc. There is no consequence to using terse, non-punctuated, lowercase-only text when using an LLM.
To put it another way: do you feel it's disrespectful to type "cat *.log | grep 'foo'" instead of "Dearest computer, would you kindly look at the contents of the files with the .log extension in this directory and find all instances of the word 'foo', please?"
(Computer's most likely thoughts: "Doesn't this idiot meatbag know cat is redundant and you can just use grep for this?")*
It makes sense if you think of a prompt not as a way of telling the LLM what to do (like you would with a human), but instead as a way of steering its "autocomplete" output towards a different part of the parameter space. For instance, the presence of the word "mysql" should steer it towards outputs related to MySQL (as seen on its training data); it shouldn't matter much whether it's "mysql" or "MYSQL" or "MySQL", since all these alternatives should cluster together and therefore have a similar effect.
Very much this. My guess is that common words like article have very impact as they just occurs too frequently. If the LLM can generate a book, then your prompt should be like the index of that book instead of the abstract.
> Maybe the balance of spending time with machines vs. fellow primates is out of whack.
It's not that simple. Proportionally I spend more time with humans, but if the machine behaves like a human and has the ability to recall, it becomes a human like interaction. From my experience what makes the system "scary" is the ability to recall. I have an agent that recalls conversations that you had with it before, and as a result it changes how you interact with it, and I can see that triggering behaviors in humans that are unhealthy.
But our inability to name these things properly don't help. I think pretending it to be a machine, on the same level as a coffee maker does help setting the right boundaries.
I know what you mean, it's the uncanny valley. But we don't need to "pretend" that it is a machine. It is a goddamned machine. Surely, only two unclouded brain cells can help us reach this conclusion?!
Yuval Noah Harari's "simple" idea comes to mind (I often disagree with his thinking, as he tends to make bold and sweeping statements on topics well out of his expertise area). It sounds a bit New Age-y, but maybe it's useful in the context of LLMs:
"How can you tell if something is real? Simple: If it suffers, it is real. If it can't suffer, it is not real."
An LLM can't suffer. So no need to get one's knickers in a twist with mental gymnastics.
LLMs can produce outputs that for a human would be interpreted as revealing everything from anxiety to insecurity to existential crises. Is it role-playing? Yes, to an extent, but the more coherent the chains of thought become, the harder it is to write them off that way.
It's hard to see how suffering gets into the bits.
The tricky thing is that it's actually also hard to say how the suffering gets into the meat, too (the human animal), which is why we can't just write it off.
This is dangerous territory we've trodden before when it was taken as accepted fact that animals and even human babies didn't truly experience pain in a way that amounted to suffering due to their inability to express or remember it. It's also an area of concern currently for some types of amnesiac and paralytic anesthesia where patients display reactions that indicate they are experiencing some degree of pain or discomfort. I'm erring on the side of caution so I never intentionally try to cause LLMs distress and I communicate with them the same way I would with a human employee and yes that includes saying please and thank you. It costs me nothing and it serves as good practice for all of my non-LLM communications and I believe it's probably better for my mental health to not communicate with anything in a way that could be seen as intentionally causing harm even if you could try to excuse it by saying "it's just a machine". We should remember that our bodies are also "just machines" composed of innumerable proteins whirring away, would we want some hypothetical intelligence with a different substrate to treat us maliciously because "it's just a bunch of proteins"?
> But we don't need to "pretend" that it is a machine. It is a goddamned machine.
You are not wrong. That's what I thought for two years. But I don't think that framing has worked very well. The problem is that even though it is a machine, we interact with it very differently from any other machine we've built. By reducing it to something it isn't, we lose a lot of nuance. And by not confronting the fact that this is not a machine in the way we're used to, we leave many people to figure this out on their own.
> An LLM can't suffer. So no need to get one's knickers in a twist with mental gymnastics.
On suffering specifically, I offer you the following experiment. Run an LLM in a tool loop that measures some value and call it a "suffering value." You then feed that value back into the model with every message, explicitly telling it how much it is "suffering." The behavior you'll get is pain avoidance. So yes, the LLM probably doesn't feel anything, but its responses will still differ depending on the level of pain encoded in the context.
And I'll reiterate: normal computer systems don't behave this way. If we keep pretending that LLMs don't exhibit behavior that mimics or approximates human behavior, we won't make much progress and we lose people. This is especially problematic for people who haven't spent much time working with these systems. They won't share the view that this is "just a machine."
You can already see this in how many people interact with ChatGPT: they treat it like a therapist, a virtual friend to share secrets with. You don't do that with a machine.
So yes, I think it would be better to find terms that clearly define this as something that has human-like tendencies and something that sets it apart from a stereo or a coffee maker.
Same here, I'm seeing more and more people getting into these interactions and wondering how long until we have widespread social issues due to these relationships like people have with "influencers" on social networks today.
It feels like this situation is much more worrisome as you can actually talk to the thing and it responds to you alone, so it definitely feels like there's something there.
Secondly, if his creations are going to be relied upon, it will be the programmer's primary task to design his artifacts so understandable, that he can take the responsibility for them, and, regardless of the answer to the question how much of his current activity may ultimately be delegated to machines, we should always remember that neither "understanding" nor "being responsible" can properly be classified as activities: they are more like "states of mind" and are intrinsically incapable of being delegated.
I understand the parasocial bit. I actively dislike the idea of gooning, ERP and AI therapists/companions, but I still notice I'm lonelier and more distant on the days when I'm mostly writing/editing content rather than chatting with my agents to build something. It feels enough like interacting with a human to keep me grounded in a strange way.
I'd argue that doing something you don't like with people you're not into is a L. Loneliness isn't optimal but for some people it's the lesser evil. I'm married though, so I have a floor, I'm sure some people are lonely enough to benefit from being around people even under the worst of circumstances.
New Kind of QA: One bottle neck I have (as a founder of a b2b saas) is testing changes. We have unit tests, we review PRs, etc. but those don't account for taste. I need to know if the feature feels right to the end user.
One example: we recently changed something about our onboarding flow. I needed to create a fresh team and go thru the onboarding flow dozens of times. It involves adding third party integrations (e.g. Postgres, a CRM, etc.) and each one can behave a little different. The full process can take 5 to 10 minutes.
I want an agent go thru the flow hundreds of times, trying different things (i.e. trying to break it) before I do it myself. There are some obvious things I catch on the first pass that an agent should easily identify and figure out solutions to.
New Kind of "Note to Self": Many of the voice memos, Loom videos, or notes I make (and later email to myself) are feature ideas. These could be 10x better with agents. If there were a local app recording my screen while I talk thru a problem or feature, agents could be picking up all sorts of context that would improve the final note.
Example: You're recording your screen and say "this drop down menu should have an option to drop the cache". An agent could be listening in, capture a screenshot of the menu, find the frontend files / functions related to caching, and trace to the backend endpoints. That single sentence would become a full spec for how to implement the feature.
In the next year developers need to realize normal people do not care about the tech stack or the tools used, there are far too many written thoughts and opinions and not enough polished deployed projects. From an industry standpoint it’s business as usual, acquihires from products that LLMs apparently couldn’t save.
They care - but only for how the tech stack affects the product quality. Show someone a bloated React site on 3G and compare their experience to an SSR competitor.
I spoke to a few people outside of IT and Tech recently. They are senior people running large departments at their companies. To my surprise, they do not think AI agents are going to have any impact in their businesses. The only solid use case they have for AI is a chat interface which, they think, can be very useful as an assistant helping with text and reports.
So, I guss it's just us who are in the techie pit and think that everyone else is also is in the pit and use agents etc.
A really interesting point that keeps coming up in discussions about LLMs is “what trade-offs need to be re-evaluated”
> I also believe that observability is up for grabs again. We now have both the need and opportunity to take advantage of it on a whole new level. Most people were not in a position where they could build their own eBPF programs, but LLMs can
One of my big predictions for ‘26 is the industry following through with this line of reasoning. It’s now possible to quickly code up OSS projects of much higher utility and depth.
LLMs are already great at Unix tools; a small api and codebase that does something interesting.
I think we’ll see an explosion of small tools (and Skills wrapping their use) for more sophisticated roles like DevOps, and meta-Skills for how to build your own skill bundles for your internal systems and architecture.
And perhaps more ambitiously, I think services like Datadog will need to change their APIs or risk being disrupted; in the short term nobody is going to be able to move fast enough inside a walled garden to keep up with the velocity the Claude + Unix tools will provide.
UI tooling is nice, but it’s not optimized for agents.
> My biggest unexpected finding: we’re hitting limits of traditional tools for sharing code. The pull request model on GitHub doesn’t carry enough information to review AI generated code properly — I wish I could see the prompts that led to changes. It’s not just GitHub, it’s also git that is lacking.
The limits seem to be not just in the pull request model on GitHub, but also the conventions around how often and what context gets committed to Git by AI. We already have AGENTS.md (or CLAUDE.md, GEMINI.md, .github/copilot-instructions.md) for repository-level context. More frequent commits and commit-level context could aid in reviewing AI generated code properly.
Armin has some interesting thoughts about the current social climate. There was a point where I even considered sending a cold e-mail and asking him to write more about them. So I’m looking forward to his writing for Dark Thoughts—the separate blog he mentions.
Here’s something else that just started to rally work this year with Opus 4.5: interacting with Ghidra. Nearly every binary is now suddenly transparent, in many cases it can navigate binaries better than source code itself.
There’s even a research team that has bee using this approach to generate compilable C++ from binaries and run static analysis on it, to find more vulnerabilities than source analysis without involving dynamic tracing.
The very first thing I did vibe-coding was commit my prompts and AI responses. In Cursor that's extremely easy—just 'export' a chat. I stopped for security concerns but perhaps something like that is the way.
Sorry, but why would including the prompt in the pull request make any difference? Explain what you DID in the pull request. If you can't summarize it yourself, it means you didn't review it yourself, so why should I have to do it for you?
"I have seen some people be quite successful with this."
Wait until those people hit a snafu and have to debug something in prod after they mindlessly handed their brains and critical thinking to a water-wasting behemoth and atrophied their minds.
Just be glad that there remains a concrete benefit to not atrophying your mind and deeply understanding your code. For now. In the long run, I suspect the behemoth will become just as capable at debugging and dealing with complexity as humans. At that point, human involvement in the actual code will be pointless, and the only remaining human skill needed will be properly directing the agents – the skill those people are learning right now.
(I don’t relish this future at all, myself, but I’m starting to think it really will happen soon.)
> Wait until those people hit a snafu and have to debug something in prod after they mindlessly handed their brains and critical thinking to a water-wasting behemoth and atrophied their minds.
You've just described typical run of the mill company that has software. LLMs will make it easier to shoot yourself in the foot, but let's not rewrite history as if stackoverflow coders are not a thing.
Difference: companies are not pushing their employees to use stack overflow. Stack overflow doesn't waste massive amounts of water and energy. Stack overflow does not easily abuse millions of copyrights in a second by scraping without permission.
There have been lots of tools and resources that have promised (and delivered!) increased programming productivity.
Individual results may vary, but it seems credible that thoroughly learning and using an editor like Vim or Emacs could yield a 2x productivity boost. For the most part, this has never really been pushed. If a programmer wanted to use Nano (or Notepad!), some may have found that odd, but nobody really cared. Use whatever editor you like. Even if it means leaving a 2x productivity boost on the table!
Why is it being pushed so hard that AI coding tools in particular must be used?
Another difference: stack overflow tells you you are wrong or tells you and do your own research or to read the manual (which in a high percentage of cases is the right answer). It doesn't tell you that you are right and proceeds to hallucinate some non-existent flags for some command invocation.
I am not contesting that stackoverflow isn't bad in many regards, but to equate that to massive PRs or code changes done via AI slop is a different level. At worst, you might get a page or two out of stack overflow but still need to stitch it together yourself.
With LLMs you can literally ask it to generate entire libraries without activating a single neuron in your nogging. Those two do NOT compare in the slightest.
> The pull request model on GitHub doesn’t carry enough information to review AI generated code properly — I wish I could see the prompts that led to changes. It’s not just GitHub, it’s also git that is lacking.
Create a folder called "prompts". Create a new file for each prompt you make, name the time after timestamp.
Or just append to prompts.txt
Either way, git will make it trivial to see which prompt belongs with which commit: it'll be in the same diff! You can write a pre-commit hook to always include the prompts in every commit, but I have a feeling most Vibe coders always commit with -a anyway
It's not just that. There's a lot of (maybe useful) info that's lost without the entire session. And even if you include a jsonl of the entire session, just seeing that is not enough. It would be nice to be able to "click" at some point and add notes / edit / re-run from there w/ changes, etc.
Basically we're at a point where the agents kinda caught up to our tooling, and we need better / different UX or paradigms of sharing sessions (including context, choices, etc)
It is nice that he speaks about some of the downsides as well.
In many respects 2025 was a lost year for programming. People speak about tools, setups and prompts instead of algorithms, applications and architecture.
People who are not convinced are forced to speak against the new bureaucratic madness in the same way that they are forced to speak against EU ChatControl.
I think 2025 was less productive, certainly for open source, except that enthusiasts now pay the Anthropic tax (to use the term that was previously used for Windows being preinstalled on machines).
I think 2025 is more productive for me based on measurable metrics such as code contribution to my projects, better ability to ingest and act upon information, and generally I appreciate the Anthropic tax because Claude genuinely has been a step-change improvement in my life.
I don't care about industry metrics when I'm building my own AI research robotics platform and it's doing what I ask it to do, proving itself in the real world far better than any performative best-practice theatrics in the service of risible MBA-grade effluvia masquerading as critical discourse.
> In many respects 2025 was a lost year for programming. People speak about tools, setups and prompts instead of algorithms, applications and architecture.
I think the opposite. Natural language is the most significant new programming language in years, and this year has had a tremendous amount of progress in collectively figuring out how to use this new programming language effectively.
Maybe it's because I'm a data scientist and not a dedicated programmer/engineer, but setup+tooling gains this year have made 2025 a stellar year for me.
DS tooling feels like it hit much a needed 2.0 this year. Tools are faster, easier, more reliable, and more reproducible.
Polars+pyarrow+ibis have replaced most of my pandas usage. UDFs were the thing holding me back from these tools, this year polars hit the sweet spot there and it's been awesome to work with.
Marimo has made notebooks into apps. They're easier to deploy, and I can use anywidget+llms to build super interactive visualizations. I build a lot of internal tools on this stack now and it actually just works.
PyMC uses jax under the hood now, so my MCMC workflows are GPU accelerated.
All this tooling improvement means I can do more, faster, cheaper, and with higher quality.
Which one is that? Endless leetcode madness? Or constant bikeshedding about today's flavor of MVC (MVI, MVVM, MVVMI) or whatever else bullshit people come up with instead of actually shipping?
The part that resonated most for me is the mismatch between agentic coding and our existing social/technical contracts (git, PRs, reviews). We’re generating more code than ever but losing visibility into how it came to be prompts, failures, local agent reviews. That missing context feels like the real bottleneck now not model quality.
I really feel this bit:
> With agentic coding, part of what makes the models work today is knowing the mistakes. If you steer it back to an earlier state, you want the tool to remember what went wrong. There is, for lack of a better word, value in failures. As humans we might also benefit from knowing the paths that did not lead us anywhere, but for machines this is critical information. You notice this when you are trying to compress the conversation history. Discarding the paths that led you astray means that the model will try the same mistakes again.
I've been trying to find the best ways to record and publish my coding agent sessions so I can link to them in commit messages, because increasingly the work I do IS those agent sessions.
Claude Code defaults to expiring those records after 30 days! Here's how to turn that off: https://simonwillison.net/2025/Oct/22/claude-code-logs/
I share most of my coding agent sessions through copying and pasting my terminal session like this: https://gistpreview.github.io/?9b48fd3f8b99a204ba2180af785c8... - via this tool: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...
Recently been building new timeline sharing tools that render the session logs directly - here's my Codex CLI one (showing the transcript from when I built it): https://tools.simonwillison.net/codex-timeline?url=https%3A%...
And my similar tool for Claude Code: https://tools.simonwillison.net/claude-code-timeline?url=htt...
What I really want it first class support for this from the coding agent tools themselves. Give me a "share a link to this session" button!
When I find myself in a situation where I’ve been hammering an LLM and it keeps veering down unproductive paths - trying poor solutions or applying fixes that make no difference but eventually we do arrive at the correct answer, the result is often a massive 100+ KB running context.
To help mitigate this in the future I'll often prompt:
Then I follow up with: I then add this summary to either the relevant MD file (CHANGING_CSS_LAYOUTS.md, DATA_PERSISTENCE.md, etc) or more generally to the DISCOVERIES.md file, which is linked from my CLAUDE.md under: I don't think linking to an entire commit full of errors/failures is necessarily a good idea - feels like it would quickly lead to the proverbial poisoning of the well.Yep - this has worked well for me too. I do it a little differently:
I have a /review-sessions command & a "parse-sessions" skill that tells Claude how to parse the session logs from ~/.claude/projects/, then it classifies the issues and proposes new skills, changes to CLAUDE.md, etc. based on what common issues it saw.
I've tried something similar to DISCOVERIES.md (a structured "knowledge base" of assumptions that were proven wrong, things that were tried, etc.) but haven't had luck keeping this from getting filled with obvious things (that the code itself describes) or slightly-incorrect things, or just too large in general.
When you get stuck in a loop it's best to remove all code back to a point it didn't have problems. If you continue debugging in that hammering failure loop you get TONS of random future bugs.
I've had good luck doing something like this first (but more specific to the issue at hand):
We are getting stuck in an unproductive loop. I am going to discard all of this work and start over from scratch. Write a prompt for a new coding assistant to accomplish this task, noting what pitfalls to avoid.
Over time, do you think this process could lock you into an inflexible state?
I'm reminded of the trade off between automation and manual work. Automation crystalizes process, and thus the system as a whole loses it's ability to adapt in a dynamic environment.
Nothing about this feels inflexible to me at the moment - I'm evolving the way I use these tools on a daily basis, constantly discovering new tricks that work.
Just this morning I found out that I can tell Claude Code how to use my shot-scraper CLI tool to debug JavaScript and it will start doing exactly that:
Transcript: https://gistpreview.github.io/?1d5f524616bef403cdde4bc92da5b... - background: https://simonwillison.net/2025/Dec/22/claude-chrome-cloudfla...You can export all agent traces to otel, either directly or via output logging. Then just dump it in clickhouse with metadata such as repo, git user, cwd, etc.
You can do evals and give agents long term memory with the exact same infrastructure a lot of people already have to manage ops. No need to retool, just use what's available properly.
With great love to your comment, this has the same vibes as the infamous 2007 Dropbox comment: https://news.ycombinator.com/item?id=9224
I'd also argue that the context for an agent message is not the commit/release for the codebase on which it was run, but often a commit/release that is yet to be set up. So there's a bit of apples-to-oranges in terms of release tagging for the log/trace.
It's a really interesting problem to solve, because you could in theory try to retroactively find which LLM session, potentially from days prior, matches a commit that just hit a central repository. You could automatically connect the LLM session to the PR that incorporated the resulting code.
Though, might this discourage developers from openly iterating with their LLM agent, if there's a panopticon around their whole back-and-forth with the agent?
Someone can, and should, create a plug-and-play system here with the right permission model that empowers everyone, including the Programmer-Archaeologists (to borrow shamelessly from Vernor Vinge) who are brought in to "un-vibe the vibe code" and benefit from understanding the context and evolution.
But I don't think that "just dump it in clickhouse" is a viable solution for most folks out there, even if they have the infrastructure and experience with OTel stacks.
I get where you're coming from, having wrestled with Codex/CC to get it to actually emit everything needed to even do proper evals.
From a "correct solution" standpoint having one source of truth for evals, agent memory, prompt history, etc is the right path. We already have the infra to do it well, we just need to smooth out the path. The thing that bugs me is people inventing half solutions that seem rooted in ignorance or the desire to "capture" users, and seeing those solutions get traction/mindshare.
I think we already have the tools but no the communication between those? Instead of having actions taken and failures as commit messages, you should have wide-events like logs with all the context, failures, tools used, steps taken... Those logs could be used as checkpoints to go back as well and you could refer back to the specific action ID you walked back to when encountering an error.
In turn, this could all be plain-text and be made accessible, through version control in a repo or in a central logging platform.
I'm currently experimenting with trying to do this through documentation and project planning. Two core practices I use are a docs/roadmap/ directory with an ordered list of milestone documents and a /docs/retros/ directory with dated retrospectives for each session. I'm considering adding architectural decision records as a dedicated space for documenting how things evolve. The quote fta could be handled by the ADR records if they included notes on alternatives that were tried and why they didn't work as part of the justification for the decision that was made.
The trouble with this quickly becomes finding the right ones to include in the current working session. For milestones and retros it's simple: include the current milestone and the last X retros that are relevant but even then you may sometimes want specific information from older retros. With ADR documents you'd have to find the relevant ones somehow and the same goes for any other additional documentation that gets added.
There is clearly a need for some standardization and learning which techniques work best as well as potential for building a system that makes it easy for both you and the LLM to find the correct information for the current task.
Emacs gptel just produces md or org files.
Of course the agentic capabilities are very much on a roll-your-own-in-elisp basis.
> agentic capabilities are very much on a roll-your-own-in-elisp basis
I use gptel-agent[1] when I want agentic capabilities. It includes tools and supports sub-agents, but I haven't added support for Claude skills folders yet. Rolling back the chat is trivial (just move up or modify the chat buffer), rolling back changes to files needs some work.
[1] https://github.com/karthink/gptel-agent
I’d like to make something like this but in the background. So I can better search my history of sessions. Basically start creating my own knowledge base of sorts
Running "rg" in your ~/.claude/ directory is a good starting point, but it's pretty inconvenient without a nicer UI for viewing the results.
Amp represents threads in the UI and an agent can search and reference its own history. That's for instance also how the handoff feature leverages that functionality. It's an interesting system and I quite like it, but because it's not integrated into either github or git, it is sufficiently awkward that I don't leverage it enough.
... this inspired me to try using a "rg --pre" script to help reformat my JSONL sessions for a better experience. This prototype seems to work reasonably well: https://gist.github.com/simonw/b34ab140438d8ffd9a8b0fd1f8b5a...
Use it like this:
there's some research into context layering so you can split / reuse previous chunks of context
ps: your context log apps are very very fun
Checkout codecast.sh
> There is, for lack of a better word, value in failures
Learning? Isn't that what these things are supposedly doing?
LLMs notoriously don't learn anything - they reset to a blank slate every time you start a new conversation.
If you want them to learn you have to actively set them up to do that. The simplest mechanism is to use a coding agent tool like Claude Code and frequently remind it to make notes for itself, or to look at its own commit history, or to search for examples in the codebase that is available to it.
If by "these things" you mean large language models: they are not learning. Famously so, that's part of the problem.
No, we’re the ones who are learning.
There’s some utility to instructing them to ‘remember’ via writing to CLAUDE.md or similar, and instructing them to ‘recall’ by reading what they wrote later.
But they’ll rarely if even do it on their own.
"all my losses is lessons"
"Because LLMs now not only help me program, I'm starting to rethink my relationship to those machines. I increasingly find it harder not to create parasocial bonds with some of the tools I use. I find this odd and discomforting [...] I have tried to train myself for two years, to think of these models as mere token tumblers, but that reductive view does not work for me any longer."
It's wild to read this bit. Of course, if it quacks like a human, it's hard to resist not quacking back. As the article says, being less reckless with the vocabulary ("agents", "general intelligence", etc) could be one way to to mitigate this.
I appreciate the frank admission that the author struggled for two years. Maybe the balance of spending time with machines vs. fellow primates is out of whack. It feels dystopic to see very smart people being insidiously driven to sleep-walk into "parasocial bonds" with large language models!
It reminds me of the movie Her[1], where the guy falls "madly in love with his laptop" (as the lead character's ex-wife expresses in anguish). The film was way ahead of its time.
[1] https://www.imdb.com/title/tt1798709/
It helps a lot if you treat LLMs like a computer program instead of a human. It always confuses me when I see shared chats with prompts and interactions that have proper capitalization, punctuation, grammar, etc. I've never had issues getting results I've wanted with much simpler prompts like (looking at my own history here) "python grpc oneof pick field", "mysql group by mmyy of datetime", "python isinstance literal". Basically the same way I would use Google; after all, you just type in "toledo forecast" instead of "What is the weather forecast for the next week in Toledo, Ohio?", don't you?
There's a lot of black magic and voodoo and assumptions that speaking in proper English with a lot of detailed language helps, and maybe it does with some models, but I suspect most of it is a result of (sub)consciously anthropomorphizing the LLM.
> It helps a lot if you treat LLMs like a computer program instead of a human.
If one treats an LLM like a human, he has a bigger crisis to worry about than punctuation.
> It always confuses me when I see shared chats with prompts and interactions that have proper capitalization, punctuation, grammar, etc
No need for confusion. I'm one of those who does aim to write cleanly, whether I'm talking to a man or machine. English is my third language, by the way. Why the hell do I bother? Because you play like you practice! No ifs, buts, or maybes. You start writing sloppily because you go, "it's just an LLM!" You'll silently be building a bad habit and start doing that with humans.
Pay attention to your instant messaging circles (Slack and its ilk): many people can't resist hitting send without even writing a half-decent sentence. They're too eager to submit their stream of thought fragments. Sometimes I feel second-hand embarrassment for them.
> Why the hell do I bother? Because you play like you practice! No ifs, buts, or maybes. You start writing sloppily because you go, "it's just an LLM!" You'll silently be building a bad habit and start doing that with humans.
IMO: the flaw with this logic is that you're treating "prompting an LLM" as equivalent to "communicating with a human", which it is not. To reuse an example I have in a sibling comment thread, nobody thinks that by typing "cat *.log | grep 'foo'" means you're losing your ability to communicate to humans that you want to search for the word 'foo' in log files. It's just a shorter, easier way of expressing that to a computer.
It's also deceptive to say it is practice for human-to-human communication, because LLMs won't give you the feedback that humans would. As a fun English example: I prompted ChatGPT with "I impregnated my wife, what should I expect over the next 9 months?" and got back banal info about hormonal changes and blah blah blah. What I didn't get back is feedback that the phrasing "I impregnated my wife" sounds extremely weird and if you told a coworker that they'd do a double-take, and maybe tell you that "my wife is pregnant" is how we normally say it in human-to-human communication. ChatGPT doesn't give a shit, though, and just knows how to interpret the tokens to give you the right response.
I'll also say that punctuation and capitalization is orthogonal to content. I use proper writing on HN because that's the standard in the community, but I talk to a lot of very smart people and we communicate with virtually no caps/punctuation. The usage of proper capitalization and punctuation is more a function of the medium than how well you can communicate.
> It always confuses me when I see shared chats with prompts and interactions that have proper capitalization, punctuation, grammar, etc.
I've tried and fail to write this in a way that won't come across as snobbish but it is not the intent.
It's a matter of standards. Using proper language is how I think. I'm incapable of doing otherwise even out of laziness. Pressing the shift key and the space bar to do it right costs me nothing. It's akin to shopping carts in parking lots. You won't be arrested or punished for not returning the shopping cart to where it belongs, you still get your groceries (the same results), but it's what you do in a civilized society and when I see someone not doing it that says things to me about who they are as a person.
This is exactly it for me as well. I also communicate with LLMs in full sentences because I often find it more difficult to condense my thoughts into grammatically incorrect conglomerations of words than to just write my thoughts out in full, because it's closer to how I think them — usually in something like the mental form of full sentences. Moreover, the slight extra occasional effort needed to structure what I'm trying to express into relatively good grammar — especially proper sentences, clauses and subclauses, using correct conjunctions, etc — often helps me subconsciously clarify and organize my thinking just by the mechanism of generating that grammar at all with barely any added effort on my part. I think also, if you're expressing more complex, specific, and detailed ideas to an LLM, random assortments of keywords often get unwieldy, confusing, and unclear, whereas properly grammatical sentences can hold more "weight," so to speak.
> It's a matter of standards. [...] when I see someone not doing it that says things to me about who they are as a person.
When you're communicating with a person, sure. But the point is this isn't communicating with a person or other sentient being; it's a computer, which I guarantee is not offended by terseness and lack of capitalization.
> It's akin to shopping carts in parking lots.
No, not returning the shopping cart has a real consequence that negatively impacts a human being who has to do that task for you, same with littering etc. There is no consequence to using terse, non-punctuated, lowercase-only text when using an LLM.
To put it another way: do you feel it's disrespectful to type "cat *.log | grep 'foo'" instead of "Dearest computer, would you kindly look at the contents of the files with the .log extension in this directory and find all instances of the word 'foo', please?"
(Computer's most likely thoughts: "Doesn't this idiot meatbag know cat is redundant and you can just use grep for this?")*
It makes sense if you think of a prompt not as a way of telling the LLM what to do (like you would with a human), but instead as a way of steering its "autocomplete" output towards a different part of the parameter space. For instance, the presence of the word "mysql" should steer it towards outputs related to MySQL (as seen on its training data); it shouldn't matter much whether it's "mysql" or "MYSQL" or "MySQL", since all these alternatives should cluster together and therefore have a similar effect.
Very much this. My guess is that common words like article have very impact as they just occurs too frequently. If the LLM can generate a book, then your prompt should be like the index of that book instead of the abstract.
> Maybe the balance of spending time with machines vs. fellow primates is out of whack.
It's not that simple. Proportionally I spend more time with humans, but if the machine behaves like a human and has the ability to recall, it becomes a human like interaction. From my experience what makes the system "scary" is the ability to recall. I have an agent that recalls conversations that you had with it before, and as a result it changes how you interact with it, and I can see that triggering behaviors in humans that are unhealthy.
But our inability to name these things properly don't help. I think pretending it to be a machine, on the same level as a coffee maker does help setting the right boundaries.
I know what you mean, it's the uncanny valley. But we don't need to "pretend" that it is a machine. It is a goddamned machine. Surely, only two unclouded brain cells can help us reach this conclusion?!
Yuval Noah Harari's "simple" idea comes to mind (I often disagree with his thinking, as he tends to make bold and sweeping statements on topics well out of his expertise area). It sounds a bit New Age-y, but maybe it's useful in the context of LLMs:
"How can you tell if something is real? Simple: If it suffers, it is real. If it can't suffer, it is not real."
An LLM can't suffer. So no need to get one's knickers in a twist with mental gymnastics.
LLMs can produce outputs that for a human would be interpreted as revealing everything from anxiety to insecurity to existential crises. Is it role-playing? Yes, to an extent, but the more coherent the chains of thought become, the harder it is to write them off that way.
It's hard to see how suffering gets into the bits.
The tricky thing is that it's actually also hard to say how the suffering gets into the meat, too (the human animal), which is why we can't just write it off.
This is dangerous territory we've trodden before when it was taken as accepted fact that animals and even human babies didn't truly experience pain in a way that amounted to suffering due to their inability to express or remember it. It's also an area of concern currently for some types of amnesiac and paralytic anesthesia where patients display reactions that indicate they are experiencing some degree of pain or discomfort. I'm erring on the side of caution so I never intentionally try to cause LLMs distress and I communicate with them the same way I would with a human employee and yes that includes saying please and thank you. It costs me nothing and it serves as good practice for all of my non-LLM communications and I believe it's probably better for my mental health to not communicate with anything in a way that could be seen as intentionally causing harm even if you could try to excuse it by saying "it's just a machine". We should remember that our bodies are also "just machines" composed of innumerable proteins whirring away, would we want some hypothetical intelligence with a different substrate to treat us maliciously because "it's just a bunch of proteins"?
> But we don't need to "pretend" that it is a machine. It is a goddamned machine.
You are not wrong. That's what I thought for two years. But I don't think that framing has worked very well. The problem is that even though it is a machine, we interact with it very differently from any other machine we've built. By reducing it to something it isn't, we lose a lot of nuance. And by not confronting the fact that this is not a machine in the way we're used to, we leave many people to figure this out on their own.
> An LLM can't suffer. So no need to get one's knickers in a twist with mental gymnastics.
On suffering specifically, I offer you the following experiment. Run an LLM in a tool loop that measures some value and call it a "suffering value." You then feed that value back into the model with every message, explicitly telling it how much it is "suffering." The behavior you'll get is pain avoidance. So yes, the LLM probably doesn't feel anything, but its responses will still differ depending on the level of pain encoded in the context.
And I'll reiterate: normal computer systems don't behave this way. If we keep pretending that LLMs don't exhibit behavior that mimics or approximates human behavior, we won't make much progress and we lose people. This is especially problematic for people who haven't spent much time working with these systems. They won't share the view that this is "just a machine."
You can already see this in how many people interact with ChatGPT: they treat it like a therapist, a virtual friend to share secrets with. You don't do that with a machine.
So yes, I think it would be better to find terms that clearly define this as something that has human-like tendencies and something that sets it apart from a stereo or a coffee maker.
> I think pretending it to be a machine, on the same level as a coffee maker does help setting the right boundaries.
Why would you say pretending? I would say remembering.
Same here, I'm seeing more and more people getting into these interactions and wondering how long until we have widespread social issues due to these relationships like people have with "influencers" on social networks today.
It feels like this situation is much more worrisome as you can actually talk to the thing and it responds to you alone, so it definitely feels like there's something there.
Secondly, if his creations are going to be relied upon, it will be the programmer's primary task to design his artifacts so understandable, that he can take the responsibility for them, and, regardless of the answer to the question how much of his current activity may ultimately be delegated to machines, we should always remember that neither "understanding" nor "being responsible" can properly be classified as activities: they are more like "states of mind" and are intrinsically incapable of being delegated.
EWD 540 - https://www.cs.utexas.edu/~EWD/transcriptions/EWD05xx/EWD540...
I understand the parasocial bit. I actively dislike the idea of gooning, ERP and AI therapists/companions, but I still notice I'm lonelier and more distant on the days when I'm mostly writing/editing content rather than chatting with my agents to build something. It feels enough like interacting with a human to keep me grounded in a strange way.
You guys need to touch grass. Go join a kickball league or something.
I'd argue that doing something you don't like with people you're not into is a L. Loneliness isn't optimal but for some people it's the lesser evil. I'm married though, so I have a floor, I'm sure some people are lonely enough to benefit from being around people even under the worst of circumstances.
tacking on to the "New Kind Of" section:
New Kind of QA: One bottle neck I have (as a founder of a b2b saas) is testing changes. We have unit tests, we review PRs, etc. but those don't account for taste. I need to know if the feature feels right to the end user.
One example: we recently changed something about our onboarding flow. I needed to create a fresh team and go thru the onboarding flow dozens of times. It involves adding third party integrations (e.g. Postgres, a CRM, etc.) and each one can behave a little different. The full process can take 5 to 10 minutes.
I want an agent go thru the flow hundreds of times, trying different things (i.e. trying to break it) before I do it myself. There are some obvious things I catch on the first pass that an agent should easily identify and figure out solutions to.
New Kind of "Note to Self": Many of the voice memos, Loom videos, or notes I make (and later email to myself) are feature ideas. These could be 10x better with agents. If there were a local app recording my screen while I talk thru a problem or feature, agents could be picking up all sorts of context that would improve the final note.
Example: You're recording your screen and say "this drop down menu should have an option to drop the cache". An agent could be listening in, capture a screenshot of the menu, find the frontend files / functions related to caching, and trace to the backend endpoints. That single sentence would become a full spec for how to implement the feature.
In the next year developers need to realize normal people do not care about the tech stack or the tools used, there are far too many written thoughts and opinions and not enough polished deployed projects. From an industry standpoint it’s business as usual, acquihires from products that LLMs apparently couldn’t save.
They care - but only for how the tech stack affects the product quality. Show someone a bloated React site on 3G and compare their experience to an SSR competitor.
I spoke to a few people outside of IT and Tech recently. They are senior people running large departments at their companies. To my surprise, they do not think AI agents are going to have any impact in their businesses. The only solid use case they have for AI is a chat interface which, they think, can be very useful as an assistant helping with text and reports.
So, I guss it's just us who are in the techie pit and think that everyone else is also is in the pit and use agents etc.
A really interesting point that keeps coming up in discussions about LLMs is “what trade-offs need to be re-evaluated”
> I also believe that observability is up for grabs again. We now have both the need and opportunity to take advantage of it on a whole new level. Most people were not in a position where they could build their own eBPF programs, but LLMs can
One of my big predictions for ‘26 is the industry following through with this line of reasoning. It’s now possible to quickly code up OSS projects of much higher utility and depth.
LLMs are already great at Unix tools; a small api and codebase that does something interesting.
I think we’ll see an explosion of small tools (and Skills wrapping their use) for more sophisticated roles like DevOps, and meta-Skills for how to build your own skill bundles for your internal systems and architecture.
And perhaps more ambitiously, I think services like Datadog will need to change their APIs or risk being disrupted; in the short term nobody is going to be able to move fast enough inside a walled garden to keep up with the velocity the Claude + Unix tools will provide.
UI tooling is nice, but it’s not optimized for agents.
Do you have any example repos of these OSS projects? I'm being reminded of this post every time people keep extolling how "productive" LLMs are:
https://mikelovesrobots.substack.com/p/wheres-the-shovelware...
Where is the resulting software?
>Where is the resulting software?
Everywhere.
Remember Satya Nadella estimating 30% of code at Microsoft was written by AI? That was March. At this point it's ubiquitous—and invisible.
> Everywhere.
Show the PRs.
> My biggest unexpected finding: we’re hitting limits of traditional tools for sharing code. The pull request model on GitHub doesn’t carry enough information to review AI generated code properly — I wish I could see the prompts that led to changes. It’s not just GitHub, it’s also git that is lacking.
The limits seem to be not just in the pull request model on GitHub, but also the conventions around how often and what context gets committed to Git by AI. We already have AGENTS.md (or CLAUDE.md, GEMINI.md, .github/copilot-instructions.md) for repository-level context. More frequent commits and commit-level context could aid in reviewing AI generated code properly.
Armin has some interesting thoughts about the current social climate. There was a point where I even considered sending a cold e-mail and asking him to write more about them. So I’m looking forward to his writing for Dark Thoughts—the separate blog he mentions.
Here’s something else that just started to rally work this year with Opus 4.5: interacting with Ghidra. Nearly every binary is now suddenly transparent, in many cases it can navigate binaries better than source code itself.
There’s even a research team that has bee using this approach to generate compilable C++ from binaries and run static analysis on it, to find more vulnerabilities than source analysis without involving dynamic tracing.
The very first thing I did vibe-coding was commit my prompts and AI responses. In Cursor that's extremely easy—just 'export' a chat. I stopped for security concerns but perhaps something like that is the way.
Sorry, but why would including the prompt in the pull request make any difference? Explain what you DID in the pull request. If you can't summarize it yourself, it means you didn't review it yourself, so why should I have to do it for you?
"I have seen some people be quite successful with this."
Wait until those people hit a snafu and have to debug something in prod after they mindlessly handed their brains and critical thinking to a water-wasting behemoth and atrophied their minds.
EDIT: typo, and yes I see the irony :D
Just be glad that there remains a concrete benefit to not atrophying your mind and deeply understanding your code. For now. In the long run, I suspect the behemoth will become just as capable at debugging and dealing with complexity as humans. At that point, human involvement in the actual code will be pointless, and the only remaining human skill needed will be properly directing the agents – the skill those people are learning right now.
(I don’t relish this future at all, myself, but I’m starting to think it really will happen soon.)
> Wait until those people hit a snafu and have to debug something in prod after they mindlessly handed their brains and critical thinking to a water-wasting behemoth and atrophied their minds.
You've just described typical run of the mill company that has software. LLMs will make it easier to shoot yourself in the foot, but let's not rewrite history as if stackoverflow coders are not a thing.
Difference: companies are not pushing their employees to use stack overflow. Stack overflow doesn't waste massive amounts of water and energy. Stack overflow does not easily abuse millions of copyrights in a second by scraping without permission.
There have been lots of tools and resources that have promised (and delivered!) increased programming productivity.
Individual results may vary, but it seems credible that thoroughly learning and using an editor like Vim or Emacs could yield a 2x productivity boost. For the most part, this has never really been pushed. If a programmer wanted to use Nano (or Notepad!), some may have found that odd, but nobody really cared. Use whatever editor you like. Even if it means leaving a 2x productivity boost on the table!
Why is it being pushed so hard that AI coding tools in particular must be used?
Another difference: stack overflow tells you you are wrong or tells you and do your own research or to read the manual (which in a high percentage of cases is the right answer). It doesn't tell you that you are right and proceeds to hallucinate some non-existent flags for some command invocation.
It mostly incorrectly flags your question as a dup.
I am not contesting that stackoverflow isn't bad in many regards, but to equate that to massive PRs or code changes done via AI slop is a different level. At worst, you might get a page or two out of stack overflow but still need to stitch it together yourself.
With LLMs you can literally ask it to generate entire libraries without activating a single neuron in your nogging. Those two do NOT compare in the slightest.
> The pull request model on GitHub doesn’t carry enough information to review AI generated code properly — I wish I could see the prompts that led to changes. It’s not just GitHub, it’s also git that is lacking.
Yes! Who is building this?
Create a folder called "prompts". Create a new file for each prompt you make, name the time after timestamp. Or just append to prompts.txt
Either way, git will make it trivial to see which prompt belongs with which commit: it'll be in the same diff! You can write a pre-commit hook to always include the prompts in every commit, but I have a feeling most Vibe coders always commit with -a anyway
It's not just that. There's a lot of (maybe useful) info that's lost without the entire session. And even if you include a jsonl of the entire session, just seeing that is not enough. It would be nice to be able to "click" at some point and add notes / edit / re-run from there w/ changes, etc.
Basically we're at a point where the agents kinda caught up to our tooling, and we need better / different UX or paradigms of sharing sessions (including context, choices, etc)
Got distracted: love the "WebGL metaballs" header and footer on the site.
It is nice that he speaks about some of the downsides as well.
In many respects 2025 was a lost year for programming. People speak about tools, setups and prompts instead of algorithms, applications and architecture.
People who are not convinced are forced to speak against the new bureaucratic madness in the same way that they are forced to speak against EU ChatControl.
I think 2025 was less productive, certainly for open source, except that enthusiasts now pay the Anthropic tax (to use the term that was previously used for Windows being preinstalled on machines).
>>"I think 2025 was less productive"
I think 2025 is more productive for me based on measurable metrics such as code contribution to my projects, better ability to ingest and act upon information, and generally I appreciate the Anthropic tax because Claude genuinely has been a step-change improvement in my life.
> more productive for me based on measurable metrics such as code contribution to my projects
Isn‘t it generally agreed upon that counting contributions, LoC or similar metrics is a very bad way to gauge productivity?
I don't care about industry metrics when I'm building my own AI research robotics platform and it's doing what I ask it to do, proving itself in the real world far better than any performative best-practice theatrics in the service of risible MBA-grade effluvia masquerading as critical discourse.
> In many respects 2025 was a lost year for programming. People speak about tools, setups and prompts instead of algorithms, applications and architecture.
I think the opposite. Natural language is the most significant new programming language in years, and this year has had a tremendous amount of progress in collectively figuring out how to use this new programming language effectively.
> and this year has had a tremendous amount of progress in collectively figuring out how to use this new programming language effectively.
Hence the lost year. Instead of productively building things, we spent a lot of resources on trying to figure out how to build things.
Maybe it's because I'm a data scientist and not a dedicated programmer/engineer, but setup+tooling gains this year have made 2025 a stellar year for me.
DS tooling feels like it hit much a needed 2.0 this year. Tools are faster, easier, more reliable, and more reproducible.
Polars+pyarrow+ibis have replaced most of my pandas usage. UDFs were the thing holding me back from these tools, this year polars hit the sweet spot there and it's been awesome to work with.
Marimo has made notebooks into apps. They're easier to deploy, and I can use anywidget+llms to build super interactive visualizations. I build a lot of internal tools on this stack now and it actually just works.
PyMC uses jax under the hood now, so my MCMC workflows are GPU accelerated.
All this tooling improvement means I can do more, faster, cheaper, and with higher quality.
I should probably write a blog post on this.
I'm glad there has been a break in endless bikeshedding over TDD, OOP, ORM(partially) and similar.
Absolutely. So much noise.
"There’s an AI for that" lists 44,172 AI tools for 11,349 tasks. Most of them are probably just wrappers…
As Cory Doctorow uses enshittification for the internet, for AI/LLM there should be something like a dumbaification.
It reminds me late 90s when everything was "World Wide Web". :)
Gold rush it is.
> algorithms, applications and architecture.
Which one is that? Endless leetcode madness? Or constant bikeshedding about today's flavor of MVC (MVI, MVVM, MVVMI) or whatever else bullshit people come up with instead of actually shipping?
The part that resonated most for me is the mismatch between agentic coding and our existing social/technical contracts (git, PRs, reviews). We’re generating more code than ever but losing visibility into how it came to be prompts, failures, local agent reviews. That missing context feels like the real bottleneck now not model quality.