The new calculus of AI-based coding

(blog.joemag.dev)

64 points | by todsacerdoti 8 hours ago ago

54 comments

Animats 4 hours ago

> Instead, we use an approach where a human and AI agent collaborate to produce the code changes. For our team, every commit has an engineer's name attached to it, and that engineer ultimately needs to review and stand behind the code. We use steering rules to setup constraints for how the AI agent should operate within our codebase,

This sounds a lot like Tesla's Fake Self Driving. It self drives right up to the crash, then the user is blamed.

[-]

groby_b 3 hours ago

Except here it's made abundantly clear, up front, who has responsibility. There's no pretense that it's fully self driving. And the engineer has the power to modify every bit of that decision.

Part of being a mature engineer is knowing when to use which tools, and accepting responsibility for your decisions.

It's not that different from collaborating with a junior engineer. This one can just churn out a lot more code, and has occasional flashes of brilliance, and occasional flashes of inanity.

[-]

happyPersonR 2 hours ago

Idk it’s hard to say it’s called “Full Self Driving” and then the CEO says as much.

Animats 2 hours ago

> Except here it's made abundantly clear, up front, who has responsibility.

By the people who are disclaiming it, yes.

gachaprize 4 hours ago

Classic LLM article:

1) Abstract data showing an increase in "productivity" ... CHECK

2) Completely lacking in any information on what was built with that "productivity" ... CHECK

Hilarious to read this on the backend of the most widely publicized AWS failure.

[-]

alfalfasprout 4 hours ago

Yep. The problem is then leadership sees this and says "oh, we too can expect 10x productivity if everyone uses these tools. We'll force people to use them or else."

And guess what happens? Reality doesn't match expectations and everyone ends up miserable.

Good engineering orgs should have engineers deciding what tools are appropriate based on what they're trying to do.

cadamsdotcom 3 hours ago

"We have real mock versions of all our dependencies!"

Congratulations, you invented end-to-end testing.

"We have yellow flags when the build breaks!"

Congratulations! You invented backpressure.

Every team has different needs and path dependencies, so settles on a different interpretation of CI/CD and software eng process. Productizing anything in this space is going to be an uphill battle to yank away teams' hard-earned processes.

Productizing process is hard but it's been done before! When paired with a LOT of spruiking it can really progress the field. It's how we got the first CI/CD tools (eg. https://en.wikipedia.org/wiki/CruiseControl) and testing libraries (eg. pytest)

So I wish you luck!

philipp-gayret 4 hours ago

This is the first time I see "steering rules" mentioned. I do something similar with Claude, curious how it looks for them and how they integrate it with Q/Kiro.

[-]

robjan an hour ago

"steering rules" is a core feature baked into Kiro. It's similar to the spec files use in most agentic workflows but you can use exclusion and inclusion rules to avoid wasting context.

There's currently not an official workflow on how to manage these steering files across repos if you want to have organisation-wide standards, which is probably my main criticism.

manmal 3 hours ago

Those rules are often ignored by agents. Codex is known to be quite adhering, but it falls back to its own ideas, which run counter to rules I‘ve given it. The longer a session goes on, the more it goes off the rails.

[-]

philipp-gayret 3 hours ago

I'm aware of the issues around rules as in a default prompt. I had hoped the author of the blog meant a different mechanism when they mentioned "steering rules". I do mean something different, where an agent will self-correct when it is seen going against rules in the initial prompt. I have a different setup myself for Claude Code, and would call parts of that "steering"; adjusting the trajectory of the agent as it goes.

CharlesW 3 hours ago

Everything related to LLMs is probabilistic, but those rules are also often followed well by agents.

CharlesW 3 hours ago

I'd assume it's related to this Amazon "Socratic Human Feedback (SoHF): Expert Steering Strategies for LLM Code Generation" paper: https://assets.amazon.science/bf/d7/04e34cc14e11b03e798dfec5...

bcrosby95 2 hours ago

It's amazing that their metrics exactly match the mythical "10x engineer" in productivity boost.

whiterook6 3 hours ago

This reads like "Hey, we're not vibe coding, but when we do, we're careful!" with hints of "AI coding changes the costs associated with writing code, designing features, and refactoring" sprinkles in to stand out.

r0x0r007 2 hours ago

"For me, roughly 80% of the code I commit these days is written by the AI agent" Therefore, it is not commited by you, but by you in the name of AI agent and the holy slop. What to say, I hope that 100x productivity is worth it and you are making tons of money. If this stuff becomes mainstream, I suggest open source developers stop doing the grind part, stop writing and maintaining cool libraries and just leave all to the productivity guys, let's see how far they get. Maybe I've seen too many 1000x hacker news..

[-]

visarga 2 hours ago

Just need the feedback to follow suit to be 100x as effective. Tests, docs and rapid loops of guidance with human in the loop. Split your tasks, find the structure that works.

ChadNauseam 2 hours ago

I think it's fine. For example, "I" made this library https://github.com/anchpop/weblocks . It might be more accurate to say that I directed AI to make it, because I didn't write a line of code myself. (And I looked at the code and it is truly terrible.) But I tested that it works, and it does, and it solves my problem perfectly. Yes, it is slop, but this is a leaf node in the abstraction graph, and no one needs to look at it again now that it it written

[-]

shakna an hour ago

Most code, though, is not write once and ignore. So it does matter if its crap, because every piece of software is only as good as its least dependency.

Fine for just you. Not fine for others, not fine for business, not fine the moment you star count starts moving.

moron4hire 3 hours ago

If you are producing real results at 10x then you should be able to show that you are a year ahead of schedule in 5 weeks.

Waiting to see anyone show even a month ahead of schedule after 6 months.

[-]

__MatrixMan__ 3 hours ago

I've never worked anywhere that knew where they were going well enough that it was even possible to be a month ahead of schedule. By the time a month has elapsed the plan is entirely different.

AI can't keep up because its context window is full of yesteryear's wrong ideas about what next month will look like.

ned_roberts 2 hours ago

Looking at the “metrics” they shared, going from committing just about zero code over the last two years to more than zero in the past two months may be a 10x improvement. I haven’t seen any evidence more experienced developers see anywhere near that speedup.

journal 2 hours ago

when copilot auto-completes 10 lines with 99% accuracy with a press of tab, does that not save you time typing those lines?

[-]

Macha 27 minutes ago

Sure, but probably your pre-copilot IDE was autocompleting 7-8 of those lines anyway, just by playing type tetris, and typing the code out was never the slow part?

moron4hire 8 minutes ago

Why would that save me a significant amount of time versus writing the code myself means I don't have to spend a bunch of time analyzing it to figure it what it does?

kellyjprice 2 hours ago

Is typing 10 lines of code your bottleneck?

rescripting 2 hours ago

What if I told you the hard part is not the typing.

exasperaited 4 hours ago

Absolutely none of that article has ever even so much as brushed past the colloquial definition of "calculus".

These guys actually seem rattled now.

[-]

photochemsyn 3 hours ago

Well, 'calculus' is the kind of marketing word that sounds more impressive than 'arithmetic' and I think 'quantum logic' has gone a bit stale, and 'AI-based' might give more hope to the anxious investor class, as 'AI-assisted' is a bit weak as it means the core developer team isn't going to be cut from the labor costs on the balance sheet, they're just going to be 'assisted' (things like AI-written unit tests that still need some checking).

"The Arithmetic of AI-Assisted Coding Looks Marginal" would be the more honest article title.

[-]

photonthug 2 hours ago

Yes, unfortunately a phrase that's used in an attempt to lend gravitas and/or intimidate people. It sort of vaguely indicates "a complex process you wouldn't be interested in and couldn't possibly understand". At the same time it attempts to disarm any accusation of bias in advance by hinting at purely mechanistic procedures.

Could be the other way around, but I think marketing-speak is taking cues here from legal-ese and especially the US supreme court, where it's frequently used by the justices. They love to talk about "ethical calculus" and the "calculus of stare decisis" as if they were following any rigorous process or believed in precedent if it's not convenient. New translation from original Latin: "we do what we want and do not intend to explain". Calculus, huh? Show your work and point to a real procedure or STFU

collingreen 3 hours ago

"Galaxy-brain pair programming with the next superintelligence"

skinnymuch 4 hours ago

Interesting enough to me though I only skimmed.

I switched back to Rails for my side project a month ago and ai coding when doing not too complex stuff has been great. While the old NextJS code base was in shambles.

Before I was still doing a good chunk of the NextJS coding. I’m probably going to be directly coding less than 10% of the code base from here on out. I’m now spending time trying to automate things as much as possible, make my workflow better, and see what things can be coded without me in the loop. The stuff I’m talking about is basic CRUD and scraping/crawling.

For serious coding, I’d think coding yourself and having ai as your pair programmer is still the way to go.

Madmallard an hour ago

Lots of reasonable criticisms being down-voted here. Are we being AstroTurfed? Is HN falling victim to the AI hype train money too now?

brazukadev 3 hours ago

But here's the critical part: the quality of what you are creating is way lower than you think, just like AI-written blog posts.

[-]

collingreen 3 hours ago

Upvoted for dig that is also an accurate and insightful metaphor.

Madmallard 5 hours ago

first the Microsoft guy touting agents

now AWS guy doing it !

"My team is no different—we are producing code at 10x of typical high-velocity team. That's not hyperbole - we've actually collected and analyzed the metrics."

Rofl

"The Cost-Benefit Rebalance"

In here he basically just talks about setting up mock dependencies and introducing intermittent failures into them. Mock dependencies have been around for decades, nothing new here.

It sounds like this test system you set up is as time consuming as solving the actual problems you're trying to solve, so what time are you saving?

"Driving Fast Requires Tighter Feedback Loop"

Yes if you're code-vomiting with agents and your test infrastructure isn't rock solid things will fall apart fast, that's obvious. But setting up a rock solid test infrastructure for your system involves basically solving most of the hard problems in the first place. So again, what? What value are you gaining here?

"The communication bottleneck"

Amazon was doing this when I worked there 12 years ago. We all sat in the same room.

"The gains are real - our team's 10x throughput increase isn't theoretical, it's measurable."

Show the data and proof. Doubt.

Yeah I don't know. This reads like complete nonsense honestly.

Paraphrasing: "AI will give us huge gains, and we're already seeing it. But our pipelines and testing will need to be way stronger to withstand the massive increase in velocity!"

Velocity to do what? What are you guys even doing?

Amazon is firing 30,000 people by the way.

[-]

lispisok 3 hours ago

We're back to using LOC as a productivity metric because LLMs are best at cranking out thousands of LOC really fast. Personal experience I had a colleague use Claude Code top create a PR consisting of a dozen files and thousands of line of code for something that could have been done in a couple hundred LOC in a single file.

[-]

CharlesW 3 hours ago

> We're back to using LOC as a productivity metric because LLMs are best at cranking out thousands of LOC really fast.

Can you point me to anyone who knows what they're talking about declaring that LOC is the best productivity metric for AI-assisted software development?

[-]

chipsrafferty 3 hours ago

Are you implying that the author of this article doesn't know what they are talking about? Because they basically declared it in the article we just read.

Can you point me to where the author of this article gives any proof to the claim of 10x increased productivity other than the screenshot of their git commits, which shows more squares in recent weeks? I know git commits could be net deleting code rather than adding code, but that's still using LOC, or number of commits as a proxy to it, as a metric.

[-]

CharlesW 3 hours ago

> I know git commits could be net deleting code rather than adding code…

Yes, I'm also reading that the author believes commit velocity is one reflection of the productivity increases they're seeing, but I assume they're not a moron and has access to many other signals they're not sharing with us. Probably stuff like: https://www.amazon.science/blog/measuring-the-effectiveness-...

yahoozoo 2 hours ago

I had a coworker use Copilot to implement tab indexing through a Material UI DataGrid. The code was a few hundred lines. I showed them a way to do it in literally one line passed in the slot properties.

p1necone 4 hours ago

"Our testing needs to be better to handle all this increased velocity" reads to me like a euphemistic way of saying "we've 10x'ed the amount of broken garbage we're producing".

blibble 3 hours ago

if you've ever had a friend that you knew before, then they went to work at amazon, it's like watching someone get indoctrinated into a cult

and this guy didn't survive there for a decade by challenging it

reenorap 3 hours ago

No.

The way to code going forward with AI is Test Driven Development. The code itself no longer matters. You give the AI a set of requirements, ie. tests that need to pass, and then let it code whatever way it needs to in order to fulfill those requirements. That's it. The new reality us programmers need to face is that code itself has an exact value of $0. That's because AI can generate it, and with every new iteration of the AI, the internal code will get better. What matters now are the prompts.

I always thought TDD was garbage, but now with AI it's the only thing that makes sense. The code itself doesn't matter at all, the only thing that matters is the tests that will prove to the AI that their code is good enough. It can be dogshit code but if it passes all the tests, then it's "good enough". Then, just wait a few months and then rerun the code generation with a new version of the AI and the code will be better. The humans don't need to know what the code actually is. If they find a bug, write a new test and force the AI to rewrite the code to include the new test.

I think TDD has really found its future now that AI coding is here to stay. Human code doesn't matter anymore and in fact I would wager that modifying AI generated code is as bad and a burden. We will need to make sure the test cases are accurate and describe what the AI needs to generate, but that's it.

[-]

sarchertech an hour ago

> Then, just wait a few months and then rerun the code generation with a new version of the AI and the code will be better.

How many times have you seen a code change that “passed all the tests” take down production or break an important customer’s workflow?

Usually that was just a relatively small change.

Now imagine that you regenerated literally all the code.

The code is the spec. Any other spec comprehensive enough to cover all possible functionality has to be at least as complex as the code.

HellDunkel 3 hours ago

No.

The reason AI code generation works so well is a) it is text based- the training data is huge and b) the output is not the final result but a human readable blueprint (source code), ready to be made fit by a human who can form an abstract idea of the whole in his head. The final product is the compiled machine code, we use compilers to do that, not LLMs.

Ai genereted code is not suitable to be directly transferred to the final product awaiting validation by TDD, it would simply be very inefficient to do so.

pcarolan 3 hours ago

I mostly agree, but why stop at tests? Shouldn’t it be spec driven development? Then neither the code or the language matter. Wouldn’t user stories and requirements à la bdd (see cucumber) be the right abstraction?

[-]

reenorap 3 hours ago

I don't think you're wrong but I feel like there's a big bridge between the spec and the code. I think the tests are the part that will be able to give the AI enough context to "get it right" quicker.

It's sort of like a director telling an AI the high level plot of a movie, vs giving an AI the actual storyboards. The storyboards will better capture the vision of the director vs just a high level plot description, in my opinion.

__MatrixMan__ 3 hours ago

Maybe one day. I find myself doing plenty of course correction at the test level. Safely zooming out doesn't feel imminent.

gmd63 3 hours ago

Why stop there? Whichever shareholders flood the datacenter with the most electrical signals get the most profits.

blibble 3 hours ago

you will end up with something that passes all your tests then smashes into the back of the lorry the moment it sees anything unexpected

writing comprehensive tests is harder than writing the code

[-]

reenorap 3 hours ago

Then you write another test. That's the whole point of TDD. As you keep writing more tests, the closer it gets to its final form.

[-]

blibble 3 hours ago

right, and by the time I have 2^googolplex tests then the "AI" will finally be able to produce a correctly operating hello world

oh no! another bug!

rvz an hour ago

> We will need to make sure the test cases are accurate and describe what the AI needs to generate, but that's it.

Yes. The first thing I always check in every project (an especially vibe-coded projects) is whether if:

A. Does it have tests?

B. Is the coverage over 70%?

C. Do the tests actually test for the behaviour of the code (good) or just its implementation (bad.)

If any of those requirements are missing, then that is a red flag for the project.

While TDD is absolutely valuable for clean code, focusing too much on it can be the death of a startup.

As you said the code itself is $0, then the first product is still worth $10 and the finished product is worth $1M+ once it makes money, which is what matters.