I've come to view LLMs as a consulting firm where, for each request, I have a 50% chance of getting either an expert or an intern writing my code, and there's no way to tell which.
Sometimes I accept this, and I vibe-code, when I don't care about the result. When I do care about the result, I have to read every line myself. Since reading code is harder than writing it, this takes longer, but LLMs have made me too lazy to write code now, so that's probably the only alternative that works.
I have to say, though, the best thing I've tried is Cursor's autocomplete, which writes 3-4 lines for you. That way, I can easily verify that the code does what I want, while still reaping the benefit of not having to look up all the APIs and function signatures.
I've also had a similar experience. I have become too lazy since I started vibe-coding. My coding has transitioned from coder to code reviewer/fixer vey quickly. Overall I feel like it's a good thing because the last few years of my life has been a repetition of frontend components and api endpoints, which to me has become too monotonous so I am happy to have AI take over that grunt work while I supervise.
Leaky abstractions. Lots of meta programming frameworks tried to do this over the years (take out as much crud as possible) but it always ends up that there is some edge case your unique program needs that isn’t handled and then it is a mess to try to hack the meta programming aspects to add what you need. Think of all the hundreds of frameworks that try to add an automatic REST API to a database table, but then you need permissions, domain specific logic, special views, etc, etc. and it ends up just easier to write it yourself.
If you can imagine an evolutionary function of noabstraction -> total abstraction oscilating overtime, the current batch of frameworks like Django and others are roughly the local maxima that was settled on. Enough to do what you need, but doesn’t do too much so its easy to customize to your use case.
Because, like a carpenter doesn't always make the same table, but can be tired of always making tables, I don't always write the exact same CRUD endpoints, but am tired of always writing CRUD endpoints.
No need to beg. Everyone’s got their opinion. I just wish, this being Hacker News, that more people would articulate their different opinions instead of just stopping with “I disagree.”
Well, my first comment said "reading code is harder than writing code", your comment said "reading good code is easier than writing good code". I believe the two points are about equally articulated.
The multi-line autocomplete feels like the right middle ground (for me) when working in an existing code base with well established patterns. Adding new functionality is mostly a matter of scaffolding, commenting as I go, typing the first few characters of a code block and tabbing my way through the bulk of the code.
I still have doubts but I am warming up to Claude Code. Sometimes it gets in these ruts where it tries something wrong, gets shown it's wrong, keeps digging, and can't even explain how its failed suggestions could even work. Other times, it seems to reproduce the kinds of insights (and even judgment) that I'd expect out of an actual practitioner. After suggesting code to fix problems, I always ask it to explain the context in more detail. I was recently using it help to develop a Qt C++ application, and ended up getting this interaction from it:
> Can you speculate on why this problem was only happening with these embedded buttons? Is this a flaw in Qt, or is this application trying to do something unusual?
Why this problem occurred: This is a Qt design issue, not a flaw per se. Qt's item delegate editing model assumes editors are simple widgets (line edits, spin boxes, combo boxes) that don't spawn modal dialogs. When a modal dialog opens:
1. The modal dialog steals focus from the editor button
2. Qt's QAbstractItemView sees the FocusOut event and assumes editing is complete
3. It calls QStyledItemDelegate::destroyEditor() to clean up the editor
4. The button gets deleted while the color dialog is still running
5. When the color dialog closes, it tries to interact with the deleted button
This application is doing something unusual - using a button that spawns a modal dialog as a table cell editor. Most Qt delegates use immediate editing widgets (text fields, dropdowns) that don't require modal interactions.
The eventFilter solution works by intercepting the FocusOut event that triggers editor destruction, essentially telling Qt "I'm still editing, don't destroy me yet."
> Is there a danger of leaked Qt resources now that we are blocking destruction?
Yes, that's a valid concern. By blocking FocusOut events, we're preventing Qt's normal editor cleanup mechanism. However, the risk is likely minimal because:
1. Qt still cleans up eventually - The editor will be destroyed when the table view itself is destroyed or when editing truly ends (e.g., clicking elsewhere, pressing Enter/Escape)
2. Short-lived editors - These editor widgets are typically very short-lived (created when editing starts, destroyed when editing ends)
3. Parent widget cleanup - Since the button has a parent widget, Qt's parent-child ownership will clean it up when the parent is destroyed
To be completely safe, we could add a more targeted approach - only block FocusOut events while a color dialog is actively open, rather than blocking all FocusOut events. We could track when the modal dialog is shown/hidden and only filter during that window.
But for this use case, the current solution should be fine since:
- Color picking is infrequent
- The editor lifetime is still bounded by the table view's normal lifecycle
- Qt's parent-child cleanup provides a safety net
The memory usage impact should be negligible compared to the complexity of implementing more sophisticated tracking.
-----
I've definitely had less-satisfying discussions over bugs with actual human senior software engineers than this.
It seems to be just playing the “yes and” improv game with you. You might want to also try prompting it against the same suggestions and see if it changes to follow your lead or continues to hold the original opinion.
I believe choosing a well known problem space in a well known language certainly influenced a lot of the behavior. AIs usefulness is correlated strongly with its training data and there’s no doubt been a significant amount of data about both the problem space and Python.
I’d love to see how this compares when either the problem space is different or the language/ecosystem is different.
One of my test queries for AI models is to ask it for an 8 bit asm function to do something that was invented recently enough that there is unlikely to be an implementation yet.
Multiplying two 24 bit posits in 8-bit Avr for instance. No models have succeeded yet, but usually because they try and put more than 8 bits into a register. Algorithmically it seems like they are on the right track but they don't seem to be able to hold the idea that registers are only 8-bits through the entirety of their response.
Can you generate 8-bit AVR assembly code to multiply two 24 bit posit numbers
You get some pretty funny results from the models that have no idea what a posit is. It's usually pretty clear to tell if they know what they are supposed to be doing. I haven't had a success yet (haven't tried for a while though). Some of them have come pretty close, but usually it's the trying to squeeze more than 8 bits of data into a register is what brings them down.
Yeah, so it’d be interesting to see if provided the correct context/your understanding of its error pattern, it can accomplish this.
One thing you learn quickly about working with LLMs if they have these kind of baked-in biases, some of which are very fixed and tied to their very limited ability to engage in novel reasoning (cc François Chollet), while others are far more loosely held/correctable. If it sticks with the errant patten, even when provided the proper context, it probably isn’t something an off-the-shelf model can handle.
> Although in fairness this was a year ago on GPT 3.5 IIRC
GPT3.5 was impressive at the time, but today's SOTA (like GPT 5 Pro) are almost night-and-difference both in terms of just producing better code for wider range of languages (I mostly do Rust and Clojure, handles those fine now, was awful with 3.5) and more importantly, in terms of following your instructions in user/system prompts, so it's easier to get higher quality code from it now, as long as you can put into words what "higher quality code" means for you.
I write Haskell with Claude Code and it's got remarkably good recently. We have some code at work that uses STM to have what is essentially a mutable state machine. I needed to split a state transition apart, and it did an admirable job. I had to intervene once or twice when it was going down a valid, but undesirable approach. This almost one shot performance was already a productivity boost, but didn't quite build. What I find most impressive now is the "fix" here is to literally have Claude run the build and see the errors. While GHC errors are verbose and not always the best it got everything building in a few more iterations. When it later got a test failure, I suggested we add a bit more logging - so it logged all state transitions, and spotted the unexpected transition and got the test passing. We really are a LONG way away from 3.5 performance.
Yeah, 3.5 was good when it came out but frankly anyone reviewing AI for coding not using sonnet 4.1, GPT-5 or equivalent is really not aware of what they've missed out on.
Yah, that’s a fair point.
I had assumed it’d remain relatively similar given that the training data would be smaller for languages like Haskell versus languages like Python & JavaScript.
Post-training in all frontier models has improved significantly wrt to programming language support. Take Elexir, which LLMs could barely handle a test ago, but now support has gotten really good
Great article, though I'm still reading it as it's a mammoth read!
A side note: as it's been painfully pointed out to me, "vibe coding" means not reading the code (ever!). We need a term for coding with LLMs exclusively, but also reviewing the code they output at each step.
> If someone asked how I built a side project
Then you might have to say the truth: that you didn't build it, but Claude/OpenAI/Gemini built it under your supervision.
Now we're getting somewhere. I think there's more than just supervision involved though. The ideas, direction and design are also provided by the person driving and reviewing output from the agents.
If I was told I'd be working with a fellow programmer who would make all the mistakes listed in Section 5 of the article, I'd have to say "no thanks". Yet the author ends with "I don’t think I will ever code again without the assistance of an AI model". He's a lot more thick-skinned than I.
Some people also value programs for their productive ends rather than value them for the process of writing them in a pleasing way. Personally, I've been getting more done than ever with Claude Code. That I am able to work just a few minutes at a time then let the machine go is really nice as a parent. For those of us who don't program for a day job, but need programs for our day job, Claude and friends have completely changed what's possible.
What would you expect from "AI guy vibing AI code for AI application"? Marco warned you about the "AI echo chamber" from the outset - and he kept his promise :-)
What stands out for me, is that it was all possible thanks to the fact that the AI operator/conversationalist had enough knowledge to, more or less write, it all by hand, if he chose to.
Probably it was said many times already, but it will rather be the competition between programmers with AI and programmers without one, rather than no programmers with AI.
In particular, I love this part:
"I had serious doubts about the feasibility and efficiency of using inherently ambiguous natural languages as (indirect) programming tools, with a machine in between doing all the interpretation and translation toward artificial languages endowed with strict formal semantics. No more doubts: LLM-based AI coding assistants are extremely useful, incredibly powerful, and genuinely energising.
But they are fully useful and safe only if you know what you are doing and are able to check and (re)direct what they might be doing — or have been doing unbeknownst to you. You can trust them if you can trust yourself."
Evolution went from Machine Code to Assembly to Low-level programming languages to High-level programming languages (with frameworks), to... plain English.
I don't feel good doing it, but is anyone else feeling not capitalizing text, maintaining a slightly abrasive attitude, and consciously stealing credits, yield better results from coding agents? e.g. "i want xxx implemented, can you do", "ok you do" than "I'm wondering if..." etc.
idk, my thinking is that sweatshop-slack-like inputs might correspond to more professional outcomes than exam-like questions as exams would be more likely to be solved by beginners. I also fear "Implement xxx" might be just too short, I feel they might like to have some bytes to map to outputs. Could very well be placebo as pointed out.
There is so much subjective placebo with “prompt engineering” that anyone pushing any one thing like this just shows me they haven’t used it enough yet. No offense, just seeing it everywhere.
Better results if you… tip the AI, offer it physical touch, you need to say the words “go slow and take a deep breath first”…
It’s a subjective system without control testing. Humans are definitely going to apply religion, dogma, and ritual to it.
- Threatening or tipping a model generally has no significant effect on benchmark performance.
- Prompt variations can significantly affect performance on a per-question level. However, it is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question.
That 100% tracks expectation if your technical knowledge exceeds past “believer”.
Now… for fun. Look up “best prompting” or “the perfect prompt” on YouTube. Thousands of videos “tips” and “expect recommendations” that are bordering the arcane.
> Better results if you… tip the AI, offer it physical touch, you need to say the words “go slow and take a deep breath first”…
I'm not saying I've proven it or anything, but it doesn't sound far-fetched that a thing that generates new text based on previous text, would be affected by the previous text, even minor details like using ALL CAPS or just lowercase, since those are different tokens for the LLM.
I've noticed the same thing with what exact words you use. State a problem as a lay/random person, using none of the domain words for things, and you get a worse response compared to if you used industry jargon. It kind of makes sense to me considering how they work internally, but happy to be proven otherwise if you're sitting on evidence either way :)
I tell my agent to off it self every couple of hours, it's definitely placebo as you're just introducing noise which might or might not be good. Adding hmm, <prompt> has been my goto for a bit if I want it to force to give me different results cause it appears to trigger some latent regions of the llms.
This seems to be exactly what I’m talking about though. We made a completely subjective system and now everyone has completely subjective advice about what works.
I’m not saying introducing noise isn’t a valid option, just doing it in ‘X’ or ‘y’ method as dogma is straight bullshit.
My experience exactly. (including nearly 40 years of code exposure) I just wish there was an alternative to Claude sonnet 4. I see gemini pro 2.5 as a side girlfriend, but only Claude truly vibes with me.
If you only have one choice, that is exactly why you need more choices. After doubling their prices and adding quotas with 5 hour pauses, I don't trust anthropic to not pull the rug again.
I still wonder, if (as the author mentions and I've seen in my experience) companies are pivoting to hiring more senior devs and fewer or no junior devs...
... where will the new generations of senior devs come from? If, as the author argues, the role of the knowledgeable senior is still needed to guide the AI and review the occasional subtle errors it produces, where will new generations of seniors be trained? Surely one cannot go from junior-to-senior (in the sense described in TFA) just by talking to the AI? Where will the intuition that something is off come from?
Another thing that worries me, but I'm willing to believe it'll get better: the reckless abandon with which AI solutions consume resources and are completely obvious to it, like TFA describes (3.5 GB of RAM for the easiest, 3 pillar Hanoi configuration). Every veteran computer user (not just programmers but also gamers) has been decrying for ages how software becomes more and more bloated, how hardware doesn't scale with the (mis)use of resources, etc. And I worry this kind of vibe coding will only make it horribly worse. I'm hoping some sense of resource consciousness can be included in new training datasets...
People keep saying this, but the young folks who start out with stuff are gonna surpass us old folks at some point. Us old folks just get a big head start.
Right now we're comparing seniors who learned the old way to juniors who learned the old way. Soon we'll start having juniors who started out with this stuff.
It also takes time to learn how to teach people to use tools. We're all still figuring out how to use these, and I think again, more experience is a big help here. But at some point we'll start having people who not only start out with this stuff, but they get to learn from people who've figured out how to use it already.
> Soon we'll start having juniors who started out with this stuff.
But who will hire them? Businesses are ramping down from hiring juniors, since apparently a few good seniors with AI can replace them (in the minds of the people doing the hiring).
Or is it that when all of the previous batch of seniors have retired or died of old age, businesses will have no option but to hire juniors trained "the new way", without a solid background to help them understand when AI solutions are flawed or misguided, and pray it all works out?
My claim is that the gap between junior and senior has temporarily widened, which is why someone who previously would want to hire juniors might not right now. But I expect it will narrow as a generation that learned on this stuff comes into the fold, probably to a smaller gap than existed pre-LLM.
I think it will also narrow if the tools continue to get better.
Do you mean long-term vision? Short-term the advantage is in hiring only seniors, but do you mean companies will foresee trouble looming ahead and "waste" money on juniors just to avert this disaster?
My own feeling is that this could become like a sort of... well, I recently heard of the term "population time bomb", and it was eye-opening for me. How once it starts rolling, it's incredibly hard/impossible to revert, etc.
So what if we have some sort of "experience time bomb" here? Businesses stop hiring juniors. Seniors are needed to make AI work, but their experience isn't passed on because... who to pass it to? And then juniors won't have this wealth of "on the job experience" to be able to smell AI disaster and course-correct. The kind of experience you learn from actual work, not books.
I've come to view LLMs as a consulting firm where, for each request, I have a 50% chance of getting either an expert or an intern writing my code, and there's no way to tell which.
Sometimes I accept this, and I vibe-code, when I don't care about the result. When I do care about the result, I have to read every line myself. Since reading code is harder than writing it, this takes longer, but LLMs have made me too lazy to write code now, so that's probably the only alternative that works.
I have to say, though, the best thing I've tried is Cursor's autocomplete, which writes 3-4 lines for you. That way, I can easily verify that the code does what I want, while still reaping the benefit of not having to look up all the APIs and function signatures.
I've also had a similar experience. I have become too lazy since I started vibe-coding. My coding has transitioned from coder to code reviewer/fixer vey quickly. Overall I feel like it's a good thing because the last few years of my life has been a repetition of frontend components and api endpoints, which to me has become too monotonous so I am happy to have AI take over that grunt work while I supervise.
Yeah, exactly the same for me. It's tiring writing the same CRUD endpoints a thousand times, but that's how useful products are made.
I wonder why it’s not the norm to use code generation or some other form of meta programming to handle this boring repetitive work?
Leaky abstractions. Lots of meta programming frameworks tried to do this over the years (take out as much crud as possible) but it always ends up that there is some edge case your unique program needs that isn’t handled and then it is a mess to try to hack the meta programming aspects to add what you need. Think of all the hundreds of frameworks that try to add an automatic REST API to a database table, but then you need permissions, domain specific logic, special views, etc, etc. and it ends up just easier to write it yourself.
If you can imagine an evolutionary function of noabstraction -> total abstraction oscilating overtime, the current batch of frameworks like Django and others are roughly the local maxima that was settled on. Enough to do what you need, but doesn’t do too much so its easy to customize to your use case.
Because, like a carpenter doesn't always make the same table, but can be tired of always making tables, I don't always write the exact same CRUD endpoints, but am tired of always writing CRUD endpoints.
The lazy reluctance you feel is atrophy in the making. LLMs induce that.
> Since reading code is harder than writing it,
Reading bad code is harder than writing bad code. Reading good code is easier than writing good code.
I beg to differ.
No need to beg. Everyone’s got their opinion. I just wish, this being Hacker News, that more people would articulate their different opinions instead of just stopping with “I disagree.”
Well, my first comment said "reading code is harder than writing code", your comment said "reading good code is easier than writing good code". I believe the two points are about equally articulated.
Why is reading code harder than writing it?
The multi-line autocomplete feels like the right middle ground (for me) when working in an existing code base with well established patterns. Adding new functionality is mostly a matter of scaffolding, commenting as I go, typing the first few characters of a code block and tabbing my way through the bulk of the code.
>When I do care about the result, I have to read every line myself.
isn't that the same as delegated task to jr developer but you still have to check their work as sr?
It is, but not the same as if a senior developer were writing it. I would feel much less like I have to check it then.
I still have doubts but I am warming up to Claude Code. Sometimes it gets in these ruts where it tries something wrong, gets shown it's wrong, keeps digging, and can't even explain how its failed suggestions could even work. Other times, it seems to reproduce the kinds of insights (and even judgment) that I'd expect out of an actual practitioner. After suggesting code to fix problems, I always ask it to explain the context in more detail. I was recently using it help to develop a Qt C++ application, and ended up getting this interaction from it:
> Can you speculate on why this problem was only happening with these embedded buttons? Is this a flaw in Qt, or is this application trying to do something unusual?
> Is there a danger of leaked Qt resources now that we are blocking destruction? -----I've definitely had less-satisfying discussions over bugs with actual human senior software engineers than this.
It seems to be just playing the “yes and” improv game with you. You might want to also try prompting it against the same suggestions and see if it changes to follow your lead or continues to hold the original opinion.
I believe choosing a well known problem space in a well known language certainly influenced a lot of the behavior. AIs usefulness is correlated strongly with its training data and there’s no doubt been a significant amount of data about both the problem space and Python.
I’d love to see how this compares when either the problem space is different or the language/ecosystem is different.
It was a great read regardless!
One of my test queries for AI models is to ask it for an 8 bit asm function to do something that was invented recently enough that there is unlikely to be an implementation yet.
Multiplying two 24 bit posits in 8-bit Avr for instance. No models have succeeded yet, but usually because they try and put more than 8 bits into a register. Algorithmically it seems like they are on the right track but they don't seem to be able to hold the idea that registers are only 8-bits through the entirety of their response.
Do you provide this context or just ask the model to one-shot the problem?
A clear description of the problem, but one-shot.
Something along the lines of
Can you generate 8-bit AVR assembly code to multiply two 24 bit posit numbers
You get some pretty funny results from the models that have no idea what a posit is. It's usually pretty clear to tell if they know what they are supposed to be doing. I haven't had a success yet (haven't tried for a while though). Some of them have come pretty close, but usually it's the trying to squeeze more than 8 bits of data into a register is what brings them down.
Yeah, so it’d be interesting to see if provided the correct context/your understanding of its error pattern, it can accomplish this.
One thing you learn quickly about working with LLMs if they have these kind of baked-in biases, some of which are very fixed and tied to their very limited ability to engage in novel reasoning (cc François Chollet), while others are far more loosely held/correctable. If it sticks with the errant patten, even when provided the proper context, it probably isn’t something an off-the-shelf model can handle.
100% this. I tried haskelling with LLMs and it’s performance is worse compared to Go.
Although in fairness this was a year ago on GPT 3.5 IIRC
> Although in fairness this was a year ago on GPT 3.5 IIRC
GPT3.5 was impressive at the time, but today's SOTA (like GPT 5 Pro) are almost night-and-difference both in terms of just producing better code for wider range of languages (I mostly do Rust and Clojure, handles those fine now, was awful with 3.5) and more importantly, in terms of following your instructions in user/system prompts, so it's easier to get higher quality code from it now, as long as you can put into words what "higher quality code" means for you.
I write Haskell with Claude Code and it's got remarkably good recently. We have some code at work that uses STM to have what is essentially a mutable state machine. I needed to split a state transition apart, and it did an admirable job. I had to intervene once or twice when it was going down a valid, but undesirable approach. This almost one shot performance was already a productivity boost, but didn't quite build. What I find most impressive now is the "fix" here is to literally have Claude run the build and see the errors. While GHC errors are verbose and not always the best it got everything building in a few more iterations. When it later got a test failure, I suggested we add a bit more logging - so it logged all state transitions, and spotted the unexpected transition and got the test passing. We really are a LONG way away from 3.5 performance.
I'm not sure I'd say "100% this" if I was talking about GPT 3.5...
Yeah, 3.5 was good when it came out but frankly anyone reviewing AI for coding not using sonnet 4.1, GPT-5 or equivalent is really not aware of what they've missed out on.
Yah, that’s a fair point. I had assumed it’d remain relatively similar given that the training data would be smaller for languages like Haskell versus languages like Python & JavaScript.
Post-training in all frontier models has improved significantly wrt to programming language support. Take Elexir, which LLMs could barely handle a test ago, but now support has gotten really good
3.5 was a joke in coding compared to sonnet 4.
It's so thrilling that this is actually true in just a year
Yup fair point, it’s been some time. Although vibe coding is more “miss” than “hit” for me.
I wrote some Haskell using Claude. It was great.
I've had a lot of good luck with Julia, on high performance data pipelines.
Write a blog post about this! Would love to read it.
ChatGPT is pretty useless at Prolog IME
Great article, though I'm still reading it as it's a mammoth read!
A side note: as it's been painfully pointed out to me, "vibe coding" means not reading the code (ever!). We need a term for coding with LLMs exclusively, but also reviewing the code they output at each step.
We could revive the old CASE acronym (https://en.wikipedia.org/wiki/Computer-aided_software_engine...). ;)
SWITCH: SoftWare Implementation Through Computer & Human
BASE: Brain And Silicon Engineering
CLASS: Computer/Llm-Assisted Software Specification
STRUCT: Scripting Through Recurrent User/Computer Teamup
ELSE: Electronically Leveraged Software Engineering
VOID: Very Obvious Intelligence Deficit
Okay maybe not that last one
Actually a good suggestion.
Prediction: arguments over the definition will ensue
It's called "reviewing code." I'm not taking any kind of responsibility for code that I haven't written myself.
You're not just hitting go and reviewing code though. If someone asked how I built a side project and I said "reviewing code" it would make no sense.
> If someone asked how I built a side project Then you might have to say the truth: that you didn't build it, but Claude/OpenAI/Gemini built it under your supervision.
Now we're getting somewhere. I think there's more than just supervision involved though. The ideas, direction and design are also provided by the person driving and reviewing output from the agents.
I use "Pro-coding" as it implies professionalism or process, or at least some sort of formality.
It doesn't imply AI, but I don't distinguish between AI-assisted and pre-AI coding, just vibe-coding as I think thats the important demarcation now.
Prompt coding or just prompting
"Lets prompt up a new microservice for this"
"What have you been prompting lately?"
"Looking at commits, prompt coding is now 50% of your output. Have a raise"
What is the term for getting the ick from reading?
Just use "coding", then let's reserve the word "programming" for Linus.
If I was told I'd be working with a fellow programmer who would make all the mistakes listed in Section 5 of the article, I'd have to say "no thanks". Yet the author ends with "I don’t think I will ever code again without the assistance of an AI model". He's a lot more thick-skinned than I.
Some people also value programs for their productive ends rather than value them for the process of writing them in a pleasing way. Personally, I've been getting more done than ever with Claude Code. That I am able to work just a few minutes at a time then let the machine go is really nice as a parent. For those of us who don't program for a day job, but need programs for our day job, Claude and friends have completely changed what's possible.
What would you expect from "AI guy vibing AI code for AI application"? Marco warned you about the "AI echo chamber" from the outset - and he kept his promise :-)
What stands out for me, is that it was all possible thanks to the fact that the AI operator/conversationalist had enough knowledge to, more or less write, it all by hand, if he chose to.
Probably it was said many times already, but it will rather be the competition between programmers with AI and programmers without one, rather than no programmers with AI.
In particular, I love this part:
"I had serious doubts about the feasibility and efficiency of using inherently ambiguous natural languages as (indirect) programming tools, with a machine in between doing all the interpretation and translation toward artificial languages endowed with strict formal semantics. No more doubts: LLM-based AI coding assistants are extremely useful, incredibly powerful, and genuinely energising.
But they are fully useful and safe only if you know what you are doing and are able to check and (re)direct what they might be doing — or have been doing unbeknownst to you. You can trust them if you can trust yourself."
Basically, at place we've a coding agent in a while loop.
What it does is pretty simple. You give it a problem, setup enviornment with libraries and all.
It continuously makes changes to the program, then checks it output.
And iteratively improves it.
For example, we used it to build a new method to apply diffs generated by LLMs to files.
As different models are good at different things, we managed to run it against models to figure out which method performs best.
Can a human do it? I doubt.
What kind of problems have you been throwing at it?
> English as code
First time encountering the phrase.
Evolution went from Machine Code to Assembly to Low-level programming languages to High-level programming languages (with frameworks), to... plain English.
that phrase was used all the time to describe HyperTalk and AppleTalk
In the early 70's Microdata trademarked "English" as the name of their SQL-like database retrieval language.
I don't feel good doing it, but is anyone else feeling not capitalizing text, maintaining a slightly abrasive attitude, and consciously stealing credits, yield better results from coding agents? e.g. "i want xxx implemented, can you do", "ok you do" than "I'm wondering if..." etc.
Why not just:
"Implement xxx"
?
I don't think we can offend these things (yet).
idk, my thinking is that sweatshop-slack-like inputs might correspond to more professional outcomes than exam-like questions as exams would be more likely to be solved by beginners. I also fear "Implement xxx" might be just too short, I feel they might like to have some bytes to map to outputs. Could very well be placebo as pointed out.
There is so much subjective placebo with “prompt engineering” that anyone pushing any one thing like this just shows me they haven’t used it enough yet. No offense, just seeing it everywhere.
Better results if you… tip the AI, offer it physical touch, you need to say the words “go slow and take a deep breath first”…
It’s a subjective system without control testing. Humans are definitely going to apply religion, dogma, and ritual to it.
The best research I've seen on this is:
- Threatening or tipping a model generally has no significant effect on benchmark performance.
- Prompt variations can significantly affect performance on a per-question level. However, it is hard to know in advance whether a particular prompting approach will help or harm the LLM's ability to answer any particular question.
https://arxiv.org/abs/2508.00614
That 100% tracks expectation if your technical knowledge exceeds past “believer”.
Now… for fun. Look up “best prompting” or “the perfect prompt” on YouTube. Thousands of videos “tips” and “expect recommendations” that are bordering the arcane.
> Better results if you… tip the AI, offer it physical touch, you need to say the words “go slow and take a deep breath first”…
I'm not saying I've proven it or anything, but it doesn't sound far-fetched that a thing that generates new text based on previous text, would be affected by the previous text, even minor details like using ALL CAPS or just lowercase, since those are different tokens for the LLM.
I've noticed the same thing with what exact words you use. State a problem as a lay/random person, using none of the domain words for things, and you get a worse response compared to if you used industry jargon. It kind of makes sense to me considering how they work internally, but happy to be proven otherwise if you're sitting on evidence either way :)
We all agree that prompts are affected by tokens.
The issue is that you can’t know if you are positively or negatively effecting because there is no real control.
And the effect could switch between prompts.
I tell my agent to off it self every couple of hours, it's definitely placebo as you're just introducing noise which might or might not be good. Adding hmm, <prompt> has been my goto for a bit if I want it to force to give me different results cause it appears to trigger some latent regions of the llms.
This seems to be exactly what I’m talking about though. We made a completely subjective system and now everyone has completely subjective advice about what works.
I’m not saying introducing noise isn’t a valid option, just doing it in ‘X’ or ‘y’ method as dogma is straight bullshit.
This is one of many reasons that I believe the value of current AI tech is zero if not negative.
This was a great write-up!
It looks like the methodology this chap used could become a boilerplate.
My experience exactly. (including nearly 40 years of code exposure) I just wish there was an alternative to Claude sonnet 4. I see gemini pro 2.5 as a side girlfriend, but only Claude truly vibes with me.
If Claude vibes with you, why do you need an alternative?
If you only have one choice, that is exactly why you need more choices. After doubling their prices and adding quotas with 5 hour pauses, I don't trust anthropic to not pull the rug again.
This was interesting.
I still wonder, if (as the author mentions and I've seen in my experience) companies are pivoting to hiring more senior devs and fewer or no junior devs...
... where will the new generations of senior devs come from? If, as the author argues, the role of the knowledgeable senior is still needed to guide the AI and review the occasional subtle errors it produces, where will new generations of seniors be trained? Surely one cannot go from junior-to-senior (in the sense described in TFA) just by talking to the AI? Where will the intuition that something is off come from?
Another thing that worries me, but I'm willing to believe it'll get better: the reckless abandon with which AI solutions consume resources and are completely obvious to it, like TFA describes (3.5 GB of RAM for the easiest, 3 pillar Hanoi configuration). Every veteran computer user (not just programmers but also gamers) has been decrying for ages how software becomes more and more bloated, how hardware doesn't scale with the (mis)use of resources, etc. And I worry this kind of vibe coding will only make it horribly worse. I'm hoping some sense of resource consciousness can be included in new training datasets...
People keep saying this, but the young folks who start out with stuff are gonna surpass us old folks at some point. Us old folks just get a big head start.
Right now we're comparing seniors who learned the old way to juniors who learned the old way. Soon we'll start having juniors who started out with this stuff.
It also takes time to learn how to teach people to use tools. We're all still figuring out how to use these, and I think again, more experience is a big help here. But at some point we'll start having people who not only start out with this stuff, but they get to learn from people who've figured out how to use it already.
> Soon we'll start having juniors who started out with this stuff.
But who will hire them? Businesses are ramping down from hiring juniors, since apparently a few good seniors with AI can replace them (in the minds of the people doing the hiring).
Or is it that when all of the previous batch of seniors have retired or died of old age, businesses will have no option but to hire juniors trained "the new way", without a solid background to help them understand when AI solutions are flawed or misguided, and pray it all works out?
> But who will hire them?
Anyone who wants a competitive advantage?
My claim is that the gap between junior and senior has temporarily widened, which is why someone who previously would want to hire juniors might not right now. But I expect it will narrow as a generation that learned on this stuff comes into the fold, probably to a smaller gap than existed pre-LLM.
I think it will also narrow if the tools continue to get better.
> Anyone who wants a competitive advantage?
Do you mean long-term vision? Short-term the advantage is in hiring only seniors, but do you mean companies will foresee trouble looming ahead and "waste" money on juniors just to avert this disaster?
My own feeling is that this could become like a sort of... well, I recently heard of the term "population time bomb", and it was eye-opening for me. How once it starts rolling, it's incredibly hard/impossible to revert, etc.
So what if we have some sort of "experience time bomb" here? Businesses stop hiring juniors. Seniors are needed to make AI work, but their experience isn't passed on because... who to pass it to? And then juniors won't have this wealth of "on the job experience" to be able to smell AI disaster and course-correct. The kind of experience you learn from actual work, not books.
Super long article, empty GitHub apart from the vibe stuff. I can't find any biography or affiliation.
This looks like him: https://www.bankit.art/people/marco-benedetti
I enjoyed it, but then, I trend prolix, myself.
Same