The State of AI Coding Report 2025

(greptile.com)

66 points | by dakshgupta 7 hours ago ago

70 comments

zkmon 3 hours ago

I take this "code-output" metrics with a pinch of salt. Ofcourse, a machine can generate 1000 times more lines of code similar to a power loom does. However, the comparison with power loom ends there.

How maintainable is this code output? I saw a SPA html file produced by a model, which appeared almost similar to assembly code. So if the code can only be maintained by model, then an appropriate metric should should be based on a long-term maintainability achieved, but not on instant generation of code.

[-]

a_imho 3 hours ago

My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.

As a dev I very much subscribe to this line of thought, but I also have to admit most of the business class people would disagree.

[-]

order-matters an hour ago

From a business perspective, the developer is the expert in lines of code and the assumption is that expertise should agree on the necessity of a line of code. To create lines of code that do not need to be there is akin to simply not doing your job in this perspective. The finished product should have X lines of code

so from a business standpoint, if equivalent expertise amongst staff is assumed then productivity comes down to lines of code created. Just like how you might measure productivity of a warehouse employee by the number of items moved per hour. Of course if someone just throws things across the warehouse or moves things that dont need to be moved they will maximize this metric, but that would be doing the job wrong - which is not a productivity measurement problem. though admittedly the incentive structures and competition make these things often related

the bigger issue to highlight, imo, is that the business side of things have no idea if coders are doing the job sufficiently well or not, and the lack of understanding is amplified by the reality that productivity contribution varies wildly per line, some requiring much more work to conjure than others. The person they need to rely on validate this difference per instance is the same person who is responsible for creating the lines. So there is a catch-22 on the business side. An unproductive employee can claim productivity no matter what the measurement is.

if the variance of work required per line could be understood by the business side then it could be managed for. I used to manage productivity metrics for a medical coding company, and some charts are more dense and harder to code than others. I did not know how to code a medical chart but I could still manage productivity by charts per hour while still understanding this caveat

the point isnt to use the productivity metric as a one stop shop for promoting and firing people but as a filter for attention, where all the middle of the pack stuff will more of less even out and not require too much direct attention. you then just need to get an understanding of how the average difficulty per item varies by product/project.

that said, maybe lines edited is still a step better - so that refactoring in a way that reduces the size of the codebase can still be seen as productive. 1 point for each line deleted and 1 point for each line added.

I understand that every line should be viewed as a liability, not an asset, but thats the job responsibility of the hired expert to figure out how many need to exist. its not the job of the business side of things to manage.

I wouldnt tell my foundation guys how much concrete to use, or my electrician how much wire to use, but if one team can handle more concrete per hour than another and they are both qualified professionals, it really doesnt seem unreasonable to start off conversations with an assumption that one is more productive than the other. Lazy people do exist everywhere, its usually a matter of magnitude of laziness between people more than it is a matter of actual full earnest capability

[-]

Talanes an hour ago

"Just like how you might measure productivity of a warehouse employee by the number of items moved per hour. Of course if someone just throws things across the warehouse or moves things that dont need to be moved they will maximize this metric, but that would be doing the job wrong - which is not a productivity measurement problem."

I fail to see how having a measurement that clearly doesn't measure what is actually produced isn't exactly a productivity measurement problem. If your measurement is defeated by someone doing their job badly, what use is it?

[-]

kordlessagain 24 minutes ago

The argument that some people can code with less lines of code but if no lines are written that’s an issue.

hvb2 3 hours ago

Agreed, I stopped reading at that point. You can't take yourself seriously to create a report and use LOC as your measure.

I feel like we humans try to separate things and keep things short. We do this not because we think it's pretty, we do it so our human brains can still reason about a big system. As a result LOC is a bad measure as being concise then hurts your productivity????

[-]

dakshgupta 3 hours ago

We're careful not to draw any conclusions from LoC. The fact is LoCs are higher, which by itself is interesting. This could be a good or bad thing depending on code quality, which itself varied wildly person-to-person and agent-to-agent.

[-]

mrdependable 3 hours ago

Can you expand on why it is interesting?

[-]

zed31726 3 hours ago

Because it's different. Change is important to track

dakshgupta 2 hours ago

How would you measure code quality? Would persistence be a good measure?

[-]

scuff3d 2 hours ago

That question has been baffling product managers, scrum masters, and C-suite assholes for decades. Along with how you measure engineering productivity.

epicureanideal 2 hours ago

Bad code can persist because nobody wants to touch it.

Unfortunately I’m not sure there are good metrics.

scuff3d 2 hours ago

It shouldn't be taken with a pinch of salt, it should be disregarded entirely. It's an utterly useless metric, and given that the report leads with it makes the entire thing suspect.

apercu 2 hours ago

When I was first learning Perl after being a shell scripter/sysadmin I produced a lot of code. 2-3 years later the same tasks would be way less code. So is more code good?

Also, my anecdotal experience is that LLM code is flat wrong sometimes. Like a significant percentage. I can't quote a number really, because I rarely do the same thing/similar thing twice. But it's a double digit percentage.

magicloop 2 hours ago

Your graphs roughly marry up with my anecdotal experience. After a while, when you know when and how to utilize LLMs/agents, coding does become more productive. There is a discernible improvement in productivity at the same quality level.

Also I notice it when the LLMs are offline. It feels a bit like when the internet connect fails. You remember the old days of lower productivity.

Of course, there is a lot of junk/silly ways to approach these tools but all tools are just a lever, and need judgement/skill to use them well.

dakshgupta 7 hours ago

Hi, I'm Daksh, a co-founder of Greptile. We're an AI code review agent used by 2,000 companies from startups like PostHog, Brex, and Partiful, to F500s and F10s.

About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.

We decided to compile some of the most interesting findings into a report. This is the first time we've done this, so any feedback would be great, especially around what analytics we should include next time.

[-]

ChrisbyMe 3 hours ago

Hey! Thanks for publishing this.

Would be interested in seeing the breakdown between uplift vs company size.

e.g. I work in a FAANG and have seen an uptick in the number of lines on PRs, partially due to AI coding tools and partially due to incentives for performance reviews.

[-]

dakshgupta 2 hours ago

This is a good one, wish we had included it. I'd run some analysis on this a while ago and it was pretty interesting.

An interesting subtrend is that Devin and other full async agents write the highest proportion of code at the largest companies. Ticket-to-PR hasn't worked nearly as well for startups as it has for the F500.

neom 3 hours ago

If AI tools are making teams 76% faster with 100% more bugs, one would presume you're not more productive you're just punting more debt. I'm no expert on this stuff, but coupling it with some type of defect density insights might be helpful. Would be also interested to know what percentage of AI assisted code is "rolled back" or "reverted" within 48 hours. Has there been any change in number of review iterations over time?

[-]

apercu an hour ago

Right? I want to see the problem ticket variance year over year with something to qualify the data if release velocity is more frequent.

jacekm 2 hours ago

> About a billion lines of code go through Greptile every month, and we're able to do a lot of interesting analysis on that data.

Which stats in the report come from such analysis? I see that most metrics are based on either data from your internal teams or publicly available stats from npm and PyPi.

Regardless of the source, it's still an interesting report, thank you for this!

[-]

dakshgupta 2 hours ago

Thanks! The first 4 charts as well as Chart 2.3 are all from our data!

wrs 3 hours ago

It’s hard to reach any conclusion from the quantitative code metrics in the first section, because as we all know, more code is not necessarily better. “Quantity” is not actually the same as “velocity”. And that gets to the most important question people have about AI assistance: does help you maintain a codebase long term, or does it help you fly headlong into a ditch?

So, do you have any quality metrics to go with these?

[-]

dakshgupta 3 hours ago

We weren’t able to find a good quality measure. LLM-as-judge dint feel right. You’re correct that without that the data is interesting but not particular insightful.

chis 2 hours ago

Wish you'd show data from past years too! It's hard to know if these are seasonal trends or random variance without that.

Super interesting report though.

wessorh 34 minutes ago

clearly selling the report to business people whom don't code. Like most things in the AI arena today, the report is BS about a system the mostly create technical debt and is sold as intelligence.

locusofself 3 hours ago

This is definitely interesting information and I plan to take a deeper look at it.

What a lot of us must be wondering though is:

- how maintainable is the code being outputted

- how much is this newfound productivity saving (costing) on compute, given that we are definitely seeing more code

- how many livesite/security incidents will be caused by AI generated code that hasn't been reviewed properly

[-]

dakshgupta 3 hours ago

We weren’t able to agree on a good way to measure this. Curious - what’s your opinion on code churn as a metric? If code simply persists over some number of months, is that indication it’s good quality code?

[-]

arcwhite 2 hours ago

I've seen code persist a long time because it is unmaintainable gloop that takes forever to understand and nobody is brave enough to rebuild it.

So no, I don't think persistence-through-time is a good metric. Probably better to look at cyclomatic complexity, and maybe for a given code path or module or class hierarchy, how many calls it makes within itself vs to things outside the hierarchy - some measure of how many files you need to jump between to understand it

nerevarthelame 28 minutes ago

You run a company that does AI code review, and you've never devised any metrics to assess the quality of code?

[-]

dakshgupta 19 minutes ago

We have ways to approximate our impact on code quality, because we track:

- Change in number of revisions made between open and merge before vs. after greptile

- Percentage of greptile's PR comments that cause the developer to change the flagged lines

Assuming the author is will only change their PR for the better, this tells us if we're impacting quality.

We haven't yet found a way to measure absolute quality, beyond that.

wordpad 2 hours ago

I've seen code entropy as the suggested hueriatic to measure.

TuringNYC 3 hours ago

Kudos to the designer, this site is beautiful.

[-]

a1ff00 3 hours ago

Was going to comment the same. Love the dot matrix paper look.

dionian 3 hours ago

agreed. was it AI ?! not that i care - ive been doing a lot of tailwind apps in ai with great success. AI is great for the web, takes all the tedium out of it

simonw 3 hours ago

> Lines of code per developer grew from 4,450 to 7,839 as AI coding tools act as a force multiplier.

Is that a per-year number?

If a year has 200 working days that's still only about 40 lines of code a day.

When I'm in full-blown work mode with a decent coding agent (usually Claude Code) I'm genuinely producing 1,000+ lines of (good, tested, reviewed) code a day.

Maybe there is something to those absurd 10x multiplier claims after all!

(I still think there's plenty of work done by software engineers that isn't crunching out code, much of which isn't accelerated by AI assistance nearly as much. 40 lines of code per day felt about right for me a few years ago.)

[-]

observationist 3 hours ago

If you actually work, the amount of work you do is absurdly more than the amount of work most others do, and a lot of the time, both the high and low productivity people assume everyone just does as much as they do, in both directions.

A lot of people are oblivious to Zipf distributions in effort and output, and if you ever catch on to it as a productive person, it really reframes ideas about fairness and policy and good or bad management.

It also means that you can recognize a good team, and when a bunch of high performers are pushing and supporting eachother and being held to account out in the open, amazing things happen that just make other workplaces look ridiculous.

My hope for AI is that instead of 20% of the humans doing 80% of the work, you end up with force multipliers, and a ramping up, so that more workplaces look like high function teams, making everything more fair and engaging and productive, but i suspect once people get better with AI, at least up to the point of AGI, is we're going to see the same distribution but 10x or 50x the productivity.

[-]

Garlef an hour ago

My experience with coding agents points into the other direction: It's mentally very taxing!

Usually, you have a lot of time to think on the side while coding on what to do next, strategize, etc. But if you work in small increments with an LLM agent, this time is reduced and you have to be ready for the next thing once one increment is done.

So I don't see this as an equalizer. Rather, those who can constantly push forward are getting much more than those who don't.

lumost 3 hours ago

There is a long tail of engineers working on mature/stable code bases where there are fewer extremely large diffs, or the review burden is extremely high. If you work on core software - then you can never say that a line of code was wrong "because of the AI." e.g. places where you might need 2-3x code approvers or more.

rnewme 3 hours ago

1k loc per day or 1k git additions? I don't think one person can consistently review 1k loc, and grow codebase at that speed and size and classify it as good, tested and reviewed.. Can you tell us more about your process?

[-]

simonw 3 hours ago

I'm effectively no longer typing code by hand: I decide what change I want to make and then prompt Claude Code to describe that change. Sometimes I'll have it figure out the fix too.

An example from earlier today: https://github.com/simonw/llm-gemini/commit/fa6d147f5cff9ea9...

That commit added 33 lines and removed 13 - so I'm already at a 20-lines-a-day level just from that one commit (and I shipped a few more plus a release of llm-gemini: https://github.com/simonw/llm-gemini/commits/a2bdec13e03ca8a...)

It took about 3.5 minutes. I started from this issue someone had filed against my repo:

Then I opened Claude Code and said:

  Run this command: uv run llm -m gemma-3-27b-it hi

That ran the command and returned the error message. I then said:

  Yes, fix that - the gemma models do not support media resolution

Which was enough for it to figure out the fix and run the tests to confirm it hadn't broken anything.

I ran "git diff", thought about the change it had made for a moment, then committed and pushed it.

Here's the full Claude Code transcript: https://gistpreview.github.io/?62d090551ff26676dfbe54d8eebbc...

I verified the fix myself by running:

  uv run llm -m gemma-3-27b-it hi

I pasted the result into an issue comment to prove to myself (and anyone else who cares) that I had manually verified the fix: https://github.com/simonw/llm-gemini/issues/116#issuecomment...

Here's a more detailed version of the transcript including timestamps, showing my first prompt at 10:01:13am and the final response at 10:04:55am. https://tools.simonwillison.net/claude-code-timeline?url=htt...

I built that claude-code-timeline application this morning too, and that thing is 2284 lines of code: https://github.com/simonw/tools/commits/main/claude-code-tim... - but that was much more of a vibe-coded thing, I hardly reviewed the code that was written at all and shipped it as soon as it appeared to work correctly. Since it's a standalone HTML file there's not too much that can go wrong if it has bugs in it.

[-]

WhyOhWhyQ 3 hours ago

Whenever I start reviewing code produced by Claude I find hundreds of ways to improve it.

I don't know if code quality really matters to most people or to the bottom line, but a good software engineer writes better code than Claude. It is a testament to library maintainers that Claude is able to code at all, in my opinion. One reason is that Claude uses API's in whacky ways. For instance by reading the SDL2 documentation I was able to find many ways that Claude writes SDL2 using archaic patterns from the SDL days.

I think there are a lot of hidden ways AI booster types benefit from basic software engineering practices that they actively promote damaging ideas about. Maybe it will only be 10 years from now that we learn that having good engineers is actually important.

[-]

simonw 2 hours ago

> Whenever I start reviewing code produced by Claude I find hundreds of ways to improve it.

Same here. So I tell it what improvements I want to make and watch it make them.

I've gained enough experience at prompting it that it genuinely is faster for me to tell it the change I want to make than it is for me to make that change myself, 90% of the time.

HDThoreaun an hour ago

Ok then you just make review comments and it fixes them. Still faster than writing code yourself

[-]

WhyOhWhyQ 6 minutes ago

You missed the point. The original post is about not reading code.

You actually missed the point in two ways, because my response had little to do with speed of producing code.

leothetechguy 3 hours ago

I couldn't in good conscience work like that, I believe the risk of bad AI generated code due to the tiniest of output variation is far too high. Especially in systems that need to maintain a large state governed by many rules and edge cases.

noosphr 3 hours ago

I'm a good aerospace engineer, my rockets weigh an extra 50kg after every day I work on them.

WhyOhWhyQ 3 hours ago

You're writing Python and Javascript right? Those languages are extremely easy to write in (which conversely means the legibility is likely to be poor). People maintaining legacy systems in systems level languages aren't going to be able to produce as much code as people writing Python and Javascript.

[-]

simonw 2 hours ago

Yes, mostly Python and JavaScript and SQL. I'm dabbling a little more with Go these days too.

CrzyLngPwd 3 hours ago

1,000 lines of debt that you didn't review and probably have no idea what they do.

[-]

AlexandrB 3 hours ago

Yeah, I don't get it. It's well know that "LOC" is not a good metric of developer productivity. But now that AI is writing those lines of code, it's fine as a metric?

[-]

noosphr 3 hours ago

Senior developers know that every line of code is debt. Junior developers think that every line of code is wealth.

dakshgupta 3 hours ago

This is per month, I see now that's not super clear on the chart!

cmdtab 3 hours ago

I saw your example and it was a simple cli tool. Of course you can have claude make commits effectively to it!

[-]

simonw 3 hours ago

Totally. I have dozens of "simple CLI tools" that I work on - and small plugins, and HTML+JavaScript utilities.

If I was hacking on the Linux kernel I would be delighted with myself for producing 40 lines of landed code in a single day.

[-]

eikenberry 2 hours ago

They are obviously talking about writing code against expectations greater than these simple tools. Why troll with the hyperbole?

waterproof 3 hours ago

Looks like it's a monthly number.

vb-8448 2 hours ago

In the engineering team velocity section, the most important metric is missing: change rate of new code or how many times it is change before being fully consolidated.

[-]

dakshgupta 2 hours ago

This is a great suggestion. I'll note it down for next years. Curious, do you think this would be a good proxy for code quality?

[-]

all2 2 hours ago

I would consider feature complete with robust testing to be a great proxy for code quality. Specifically, that if a chunk of code is feature complete and well tested and now changing slowly, it means -- as far as I can tell -- that the abstractions contained are at least ok at modeling the problem domain.

I would expect code that continually changes and deprecates and creates new features is still looking for a good problem domain fit.

[-]

dakshgupta 2 hours ago

Most of our customers are enterprises, so I feel relatively comfortable assuming they have some decent testing and QA in place. Perhaps I am too optimistic?

vb-8448 2 hours ago

It's tricky, but one can assume that code written once and not touched in a while is good code (didn't cause any issues, performance is good enough, ecc).

I guess you can already derive this value if you sum the total line changed by all PRs and divide it by (SLOC end - SLOC start). Ideally it must be a value slightly greater than 1.

sillyfluke 2 hours ago

It depends on how well you vetted your sanples.

fyi: You headline with "cross-industry", lead with fancy engineering productivity graphics, then caption it with small print saying its from your internal team data. Unless I'm completely missing something, it comes of as a little misleading and disingenuous. Maybe intro with what your company does and your data collection approach.

[-]

dakshgupta 2 hours ago

Apologies, that is poor wording on our part. It's internal data from engineers that use Greptile, which are tens of thousands of people from a variety of industries. As opposed to external, public data, which is where some of the charts are from.

superchris 2 hours ago

This thing that can't be measured is up 76%. Eyeroll

nekooooo 3 hours ago

i'm a designer and even i know not to measure 'lines of code' as meaningful output or impact. are we really doing this?

[-]

dakshgupta 3 hours ago

We expressly did not conclude that more lines = better. You could easily argue more lines = worse. All we wanted to show is that there are more lines.

[-]

poliphili 2 hours ago

Language like "productivity gains", "output" and "force multiplier" isn't neutral like you're claiming here, and does imply that the line count metric indicates value being delivered for the business.

psunavy03 3 hours ago

Sigh . . . once again I see "velocity" as something to be increased.

This makes me metaphorically stabby.

[-]

dakshgupta 3 hours ago

We were trying not to insinuate that, because we don’t have a good way to measure quality, without which velocity is useless.