I'm the author of this study. I spent the last year diving into ~7.3TB of data from 65,987 GitHub projects to see how far the laws of software evolution (proposed decades ago by Lehman) hold up on such a large dataset.
The Findings: A Duality
The projects with more than about 700 commits to their main branch, 16.1% of all projects, follow such stable growth curves that they could support claims of some properties being divorced from human agency.
Despite all the hardware, software, tooling, methodical changes over the last few decades and even Large Language Models till early 2025, the underlying growth trajectories of these mature systems haven't fundamentally shifted. This suggests that while our tools might make daily life easier, they might not change the fundamental physics of effort over time in large codebases.
The Role of Smaller Projects
The smaller projects (83.9% of the dataset) not only follow less stable growth curves, but are also more prone to deceleration. It’s important to note that GitHub is—rightly—a home for everything from experimental prototypes and "homework", to niche tools. This experimentation is vital for the ecosystem, but might also create challenges for the industry down the line.
Whether they were following suboptimal methods or never intended to be long-term sustainable, the observed numerical dominance of smaller projects might in itself create problems:
- Popularity vs. Quality: Training Large Language Models or building learning materials by scraping GitHub indiscriminately risks a "popularity" bias. We may learn suboptimal, immature methods simply because those patterns are numerically overwhelming compared to the more stable 16.1%.
- Feedback loop: When these learnings are used to write new code, the numerically overwhelming proportions of the small projects might lead to ‘good enough, but not yet mature’ processes being propagated, effectively drowning out the potentially better practices present in more mature projects.
- For researchers: Focusing solely on large projects can overlook a much larger and different set of projects that could benefit from a targeted study.
I’ll be around to answer any questions about the research.
I'm the author of this study. I spent the last year diving into ~7.3TB of data from 65,987 GitHub projects to see how far the laws of software evolution (proposed decades ago by Lehman) hold up on such a large dataset.
The Findings: A Duality
The projects with more than about 700 commits to their main branch, 16.1% of all projects, follow such stable growth curves that they could support claims of some properties being divorced from human agency.
Despite all the hardware, software, tooling, methodical changes over the last few decades and even Large Language Models till early 2025, the underlying growth trajectories of these mature systems haven't fundamentally shifted. This suggests that while our tools might make daily life easier, they might not change the fundamental physics of effort over time in large codebases.
The Role of Smaller Projects
The smaller projects (83.9% of the dataset) not only follow less stable growth curves, but are also more prone to deceleration. It’s important to note that GitHub is—rightly—a home for everything from experimental prototypes and "homework", to niche tools. This experimentation is vital for the ecosystem, but might also create challenges for the industry down the line.
Whether they were following suboptimal methods or never intended to be long-term sustainable, the observed numerical dominance of smaller projects might in itself create problems: - Popularity vs. Quality: Training Large Language Models or building learning materials by scraping GitHub indiscriminately risks a "popularity" bias. We may learn suboptimal, immature methods simply because those patterns are numerically overwhelming compared to the more stable 16.1%. - Feedback loop: When these learnings are used to write new code, the numerically overwhelming proportions of the small projects might lead to ‘good enough, but not yet mature’ processes being propagated, effectively drowning out the potentially better practices present in more mature projects. - For researchers: Focusing solely on large projects can overlook a much larger and different set of projects that could benefit from a targeted study.
I’ll be around to answer any questions about the research.