The testing pyramid is an outdated economic model

(wiremock.io)

49 points | by hackeruse 6 months ago ago

44 comments

creesch 6 months ago

Okay? This seems like a fluff blog post as the trophy concept was coined back in 2018 if I am correct. Coming from wiremock it makes sense given their product, but it is just marketing fluff.

Honestly, as long as the GUI tip remains as small as possible I am mostly fine with whatever shape it takes below there. For modern web applications with a lot of APIs it does make sense to use a trophy. For other applications without such a communication layer a more traditional pyramid does make more sense.

What a lot of people often seem to completely overlook in discussions like this is that the pyramid isn't a goal in itself. It is intended as a way to think about where you place your tests. More specifically place tests where they make sense, provide most value and are least fragile.

Which is why the GUI should be avoided for any test that are testing logic, hence being the smallest section on whatever shape you come up with. Everything else highly depends on what sort of infrastructure you are dealing with, the scope of your application, etc.

[-]

hitchstory 6 months ago

I think the premise is correct and I think you are disagreeing with it.

Yes, the pyramid was set out as a goal in its original incarnation. That was deeply wrong. The shape ought to be emergent and determined by the nature of the app being tested (i went into detail on what should determine that here https://news.ycombinator.com/item?id=42709404)

Some of the most useful tests Ive worked with HAVE had a large GUI tip. The GUI behavior was the most stable surface whose behavior was clearly defined which everybody agreed upon. all the code got tested. GUI tests provided the greatest freedom to refactor, covered the most bugs and provided the most value by far on that project.

GUI tests are not inherently fragile or inherently too slow either. This is just a tendency that is highly context specific, and as the "pyramid" demonstrates - if you build a rule out of a tendency that is context specific it's going to be a shit rule.

[-]

creesch 6 months ago

> Some of the most useful tests Ive worked with HAVE had a large GUI tip. The GUI behavior was the most stable surface whose behavior was clearly defined which everybody agreed upon.

This might be true, but that might also have said something about the layers below that and actually be a symptom for a larger issue within the development organisation.

> GUI tests are not inherently fragile or inherently too slow either.

Compared to testing APIs or Unit tests they are though. Not only do you need to navigate an interface with a machine that is actually intended for humans, you need to also deal with the additional overhead.

[-]

hitchstory 6 months ago

>something about the layers below

Absolutely. They were tests over a big ball of mud in a company I had joined recently.

This is, I think, the only good way to work with what is probably (unfortunately) the most common type of real world code architecture.

If your testing approach cant deal with big fragile balls of mud then it is bad. This is why I dont have a lot of respect for the crowd that thinks you must DI first "in order to be able to test". Such architectures are fragile and will break under attempts to introduce dependency inversion.

>Compared to testing APIs or Unit tests they are though.

In the above example there probably wasnt a single code interface or API under the hood that was any good. Coupling to any of those interfaces was fragile with a capital F if you actually expected to refactor any of them (which I did).

Even for decent quality code, the freedom to refactor interfaces is wildly underrated and it is curtailed by coupling a test to it.

orwin 6 months ago

I'm pretty sure you ought to make the tests you think are needed, and the 'form' is not a good metric.

Our libraries with a lot of logic and calculations are dominated by unit tests, our libraries that talk to external API are dominated by integration tests. That's just good testing, I'm not sure you need to imagine a pyramid or a vase to decide the tests you do.

[-]

diggan 6 months ago

People are addicted to metrics, so when "code coverage" becomes something to measure, people tend to go crazy with the testing, even trivial stuff that doesn't really need to be put under tests.

My personal rule of thumb is something like: If it makes you go slower, you're doing too little/much testing, if it makes you go faster, you're doing the right amount of testing.

If you find yourself having to rewrite 10% of the code base every time a test changes, you're probably doing too much testing (or not treating your test case as production code). If you find yourself breaking stuff all over the place when doing other things, you're doing too little testing.

As most things, it's a balance, too extreme in either direction will hurt.

[-]

OJFord 6 months ago

That probably works well when you've already established a good baseline, but when tests barely exist (or greenfield) or they're crap, I find it can be a real slowdown to try to test something you know you should.

Especially if there's something not entirely straightforward about it, like you need to figure out a way to instrument/harness something for the first time so that you can actually test against it. (Arguably inherently doesn't happen with unit tests though I guess.)

[-]

diggan 6 months ago

> like you need to figure out a way to instrument/harness something for the first time so that you can actually test against it

Indeed, and sometimes it's not worth to figure out, but sometimes it is. For example, if you have a piece of code that haven't changed since the beginning, and there been no bugs stemming from that code, adding tests to it is kind of futile unless you already know it has to be refactored soon.

On the other hand, if you join a existing project with almost no tests, and for the last N months, that code is responsible for introducing lots of bugs, then it's probably worth it to spend time building some testing infrastructure around the project with that part first up to be under lots of tests.

Basically, if the calculation shows that you can save more time by not having those bugs, you have some budget to spend on making the tests work well for that part.

anymouse123456 6 months ago

> I find it can be a real slowdown to...

Not OP, but this is the vibes approach I often use (and believe they're advocating for).

If it feels painful to add a new test, it's likely time (or nearly time) to make adding tests, or at least that test easier.

I've found that once it's easy to add tests, new tests often start appearing, so that first effort can be some of the most high-impact work available to me.

cogman10 6 months ago

Completely agree.

I think the problem some devs want hard and fast rules when a lot of the time the right answer is "it depends" and experience dictated actions.

[-]

anymouse123456 6 months ago

The thing that frustrated me as a new player, was that there often seemed to be two opposing poles and neither were very helpful.

There is the dogmatic rules crowd (triggers my self-diagnosed opposition defiance disorder), and the "it depends" crowd, which left me screaming, "ON WHAT?!"

When I was in this position, I found Kent Beck and Martin Fowler's notion of "Code Smells" [0] really helpful. Though admittedly, the comprehensive enumeration with associated Refactorings was probably a bridge too far.

"Code Smells" lean toward the "it depends" vibe, but with just enough structure to aid in decision making. It also bypasses my inflexible opposition to stupid rules in stupid places.

I try to frame too much or too little testing as a Code Smell and discussing it that way often (not always) leads to reasonably easy consensus related to what we should (or shouldn't) do about it.

[0] https://martinfowler.com/bliki/CodeSmell.html

[-]

cogman10 6 months ago

> which left me screaming, "ON WHAT?!"

The on what is context and situational dependent. Heck, there's even an aspect of personal preference in there.

From the perspective of a code smell, it's very similar to real life smell. Garlic is an awesome ingredient and in the right context, like a good Italian dish or pizza, it's the thing that makes you go "mmm". However, if you are making oatmeal or a desert, garlic is probably the last smell you want.

Code smells are much the same way and, like garlic, people will disagree on what is a code smell and good coding practice. While there are some smells that are somewhat universally despised (such as raw sewage) there are others that can be arguably good or bad based on context or preference.

To pull this out further, a common code smell is "long methods". Yet, if you are writing something like a parser or a CPU emulator, then a giant switch statement is really going to be the right thing to inject at some point even though it might create a 1000 line monstrosity method.

Martin acknowledges this in his essay.

> ... smells don't always indicate a problem. Some long methods are just fine. You have to look deeper to see if there is an underlying problem there - smells aren't inherently bad on their own - they are often an indicator of a problem rather than the problem themselves.

This where I think the correct position to take is just going to be "it depends" with an understanding that even if something isn't your preference it might be someone else's. Today's best practice has a nasty tendency to turn into tomorrows code smell. Being aware of that fact will help you not to quickly jump to conclusions about the state of a code base or the competence of it's devs. You might even learn something cool about what you can do by breaking the rules/smells/dogma.

I know it can be frustrating, but really a lot of this just comes with experience and humility to know you and everyone else won't always be right about everything. There's no high priest of good code.

blueflow 6 months ago

Yeah, its the "model" that is outdated. Not the testing itself.

VMG 6 months ago

> The pyramid is also an artifact of the era in which it was created.

> Computers were slower, testing and debugging tools were rudimentary, and developer infrastructure was bloated, cumbersome, and inefficient."

What AMD giveth, Electron taketh away.

No matter how fast computers get, developers will figure out a way to use that extra compute to make the build and the test cycle slower again.

Of course it is all relative - it is hard to define what a "unit" test is when you are building on top of enormous abstractions to begin with.

No matter what you call the test, it should be fast. I feel productive when I can iterate on a piece of code with 2 second feedback time.

[-]

giorgioz 6 months ago

> What AMD giveth, Electron taketh away.

This is actually true but the moralistic negative tone and no explanation about it makes me think the writer did not understand why this is happening and why it has both PROs and CONs. It's similar to some other statements I heard before on this subject "It's pointless to add/increase roads, there will always be traffic". It's true there will always be traffic but it's not pointless. There will always be traffic because moving more cars becomes faster so more people do it. You should consider though that traffic on a single lane helped 100 people while traffic on a 2 lanes street helped many more people. The same is true for software development. Computers get faster, but programs tend to stay around the same amount of perceived speed. Like roads increase and yet there is still the same amount of traffic. When computers get faster it means that developers can write code faster and so they can write more code and/or cheaper code. Writing programs becomes also cheaper so developer need to be less expert and trained. The computer that brought astronauts to the moon was probably less powerful than today's smart thermostat. Yet to land on the moon with that computer required a team of people that were likely at phd level, intensely focus and dedicated and they were all socially and culturally adjacent to the inventor of the computer. By comparison, today's programs do trivial things using immense resources. And yet because many more developers can code, there are also immensely more programs about millions of use cases that are developed all over the world, by people that do not even speak English in some cases.

So programs did become less efficient because the true bottleneck was not the efficiency of the program. The true bottleneck was developer hours and skills.

This doesn't mean that it's okay for all programs to be slow or that you should be satisfied in using programs that you perceive as slow. The correlation between speed/efficiency of a program and its UX it's a Bell curve. At the beginning the faster it gets the better the UX. After a certain speed though the UX improves marginally. If the final user cannot distinguish between the speed of two different programs it means the bottleneck is not anymore about speed and another characteristic becomes the bottleneck. This said, there will always be work for efficiency engineers or low level developers to write more performant code. But not all code will require to be written as efficiently as possible.

[-]

VMG 6 months ago

I didn't intend it as a negative tone. It is meant as an observation: while the raw system speed has increased orders of magnitudes, certain high-level operations seem to remain constant speed over time.

The time it takes to booting an operating system, start a program, compile a program or run a test suite seems to remain somewhat constant over my career.

It indicates that the determining factor is not the clock speed of the underlying system but instead the pain tolerance of the users or developers.

[-]

giorgioz 6 months ago

Thank you for answering and explaining @VMG! Yes exactly, some features of software will naturally gravitate to some values where they are good enough and something else becomes the bottleneck and determining factors. I still believe though there will be a niche space for people that need the extra performance because their use case takes an advantage from it.

mkoubaa 6 months ago

Software has gotten so slow we've forgotten how fast computers are

[-]

Ygg2 6 months ago

Also, hardware is rapidly approaching saturation point of the Moore's law. Software will have to adapt.

weinzierl 6 months ago

"The pyramid is also an artifact of the era in which it was created. Computers were slower, testing and debugging tools were rudimentary, and developer infrastructure was bloated, cumbersome, and inefficient."

In addition to that, I think a major point is that the testing pyramid was conceived in a world where desktop GUI applications ruled. Testing a desktop GUI was incredibly expensive and automation extremely fragile. That is in my opinion where the pointy tip of the pyramid came from in the first place.

"But the majority of tests are of the entire service, via its API [..]"

I think this is where you get the best bang for your buck because your goal to keep your tests robust is well aligned with the goal to keep the API stable. This is not the case above and below, where the goal of robust tests is always at odds with change, quick adaption and rapid iteration.

[-]

dtech 6 months ago

these pointy shape still holds, because we often have multiple services now and testing across services is difficult and expensive.

mattgreenrocks 6 months ago

IMO, it's less about the type of test and more about your ability to get in and test as many code paths as you can of features that your users perceive as critical.

Sometimes that requires E2E tests, sometimes that's integration or unit tests.

My preference is to use something like functional core/imperative shell as much as possible, but the more external dependencies you have the more work you have to do to create an isolated environment free of IO. Not saying it isn't worthwhile, but sometimes it's easier to simply just accept that the tests will be slower due to relying on real endpoints and move on. After all, tests should support velocity, not be an end in and of themselves.

mrkeen 6 months ago

We don't need this article because the 90%-unit/10%-itest was only ever a goal to aspire to. Just like achieving 90% code-coverage - no need for a thinkpiece to say that 40% or 60% is now 'the right amount' of code-coverage.

We like units because they are fast, deterministic, parallelisable... all the good stuff. Relative to that ideal, integrations are slower, flakier, more sequential, etc.

While I've never gone full-TDD, those guys have it absolutely right that testability is a design/coding activity, not a testing activity. TDD will tell you if you're writing unit-testable or not, but it won't tell you how. Dependency-inversion / Ports-and-adapters / Hexagonal-architecture are the topics to read on how to write testable code.

What's my personal stake in this? Firstly, our bugfix-to-prod-release window is about four hours. Way too long. Secondly, as someone relatively new to this codebase, when I stumble across some suspicious logic, I can't just spit out a new unit test to see what it does, since it's so intermingled with MS-SQL and partner integrations. Our methods pass around handles to the DB like candy.

So what I think has happened here, is that we generally don't think about writing testable code as an industry. Therefore our code is all integrations, and no units. So when we go to test it, of course the classic testing pyramid is unachievable.

[-]

6 months ago

[deleted]

bluGill 6 months ago

What is a unit?

I have never seen a formal definition. Without that we cannot have any discussion.

To some a unit is a function. To some it is a module (generally someone else's module). To some it is an entire application. To some it is the entire computer in your embedded device. To some it is the entire device... Most people have no clue what we are talking about and don't car (also should not care).

[-]

mrkeen 6 months ago

My working definition is that a unit is something which can be unit-tested. By that I mean: the code being tested has the same properties that you expect of a good unit-test. I guess fast and deterministic are good enough properties for a working definition.

writeFile(fname,"Hello, World") is only one thing, but its behavior will depend on the state of the filesystem, so it's not a unit.

parseComplicatedObject(bytes) could be a unit (even if it calls out to many other sub-parsers - as long as they are also units).

One thing I see in a lot of companies is efforts to reduce test flakiness. Devs will attempt this work in src/test. But if the code is flaky, and you change the test from flaky to reliable, then you have just decreased the realism of your test. Like I mentioned with my comment above, you reduce test flakiness by doing the hard work in src/main. The src/test changes should be easy after that.

jonstewart 6 months ago

From TFA:

  Since then, significant progress in both technology and development practices has transformed testing in three key ways:

  1) It’s now possible to run a wide range of tests on an application very quickly through its public interface, enabling a broader scope of testing without excessive time or resource constraints.
  
  2) Improved test frameworks have brought down the cost and effort required to write robust, maintainable integration tests, offering accessible, scalable ways of validating the interplay between components.
  
  3) The development of sophisticated debugging tools and enhanced observability platforms has made it much easier to identify the root causes of failures. This improvement reduces the reliance on narrowly focused unit tests.

These assertions are simply made, not argued or justified. Maybe they apply to the code they're writing? I don't think they apply to my code.

cies 6 months ago

In the place we currently work in we follow the "test have to be paid for". So either devs believe that they can build/fix a piece of code faster by adding tests (this is usually unit tests), in this case they are part of another issue. Or the business mandates the tests (usually integration or e2e), which is then put on the board as an issue.

Is is very specific to the business we are in: "move fast and occasionally break things" is acceptable.

OTOH we have focused a lot on adding type safety to avoid many basic mistakes. Exceptions to result-types, no more implicit nulls, replace JS with Elm, replace Java with Kotlin, replace SQL-in-strings with jOOQ, and a culture of trying to write code that does not allow bad states to be expressed. This had precedence on writing extensive test suites.

teeray 6 months ago

> 1) It’s now possible to run a wide range of tests on an application very quickly through its public interface, enabling a broader scope of testing without excessive time or resource constraints.

> 2) Improved test frameworks have brought down the cost and effort required to write robust, maintainable integration tests, offering accessible, scalable ways of validating the interplay between components.

> 3) The development of sophisticated debugging tools and enhanced observability platforms has made it much easier to identify the root causes of failures. This improvement reduces the reliance on narrowly focused unit tests.

Citation needed on all of these. Where are the specific tools that make running all of these magically fast and reliable (read: not flakey) integration tests possible?

bluGill 6 months ago

I have long written the pyramid with the vertical axis of runtime. At the bottom are tests that run so fast you would run them every build even if there is no possible way your code changes could break those tests. Then tests that you run if they test what you just changed. Next tests you run on all CI builds [locally only when they fail]. Then tests you run regularly but not ever CI builds (every night or every week). Then tests that are so expensive you run them rarely - these generally are manual tests where a human is running the test in some real world condition that is hard to setup.

There is value in every level.

sohnee 6 months ago

I think we all agree that the majority of tests should be fast, automated, and robust. When the pyramid was written, that probably _did_ mean unit tests. In 2025 it doesn't.

Let's keep the pyramid but rename the segments!

[-]

__MatrixMan__ 6 months ago

I propose the names: small, mid, and large. More nuanced lingo around tests is typically about sounding smart when really what you should be doing it not classifying but analyzing your particular case which will always defy classification at least a little.

tobyhinloopen 6 months ago

It kinda depends on your architecture. If you can run integration tests for cheap, it makes sense to favor them over smaller unit tests.

I like to design my applications so all slow components can be mocked by faster alternatives, and have the HTTP stack as thin as possible so I can basically call a function and assert the output, while the output closely resembles the final HTTP response, either rendering a template with a blob of data, or rendering the blob of data as JSON.

pyrale 6 months ago

I would argue that this opinion has developed because the footprint of the average service has shrinked over the years.

If you're testing large, complex services that involve many different behaviours, you're still going to have a test pyramid. If you've implemented microservices, what you used to call an integration test has now become an e2e test in your new architecture. And you still don't want to have mostly that.

msoad 6 months ago

Here is a conundrum:

With all that AI generated code being pushed, as a leader I wonder which is better? Enforce a ton of e2e so no code that is really well thought through all aspects of the solution can go past CI or does this enable AI to go even crazier and break all sort of best practices to just pass the test?

[-]

__MatrixMan__ 6 months ago

The prompt, the tests, the surrounding code, and the compiler and linters form a sort of box to constrain the AI. It works best if that box is kept small.

AI capability drops sharply once the context gets too big. Iterating with an AI against an E2E means involving enough context that you're likely to run into problems with the AI's capability, but even if not it means that there's a lot more space for creativity before you get the signal that you've gone too far.

It's too easy to forget that you've omitted a crucial file from the context and instead be iterating on increasingly desperate prompts--it's the kind of mistake you want to catch and correct early, so again: as small boxes.

For these reasons, I think lots of E2Es is the wrong play, because it creates big boxes.

If I were leading a team of AI-using devs I'd be looking for ways to create higher fidelity constraints which can then form part of that box, or which interrupt the cases where we get lazy and let it be unconstrained by any requirement except that nobody has screamed about it yet.

This would be stuff like having teams communicate their needs to one another by creating ignored failing tests in the other team's repo such that they can be un-ignored once they pass. Or ensuring that the designs aren't just user focused but include the kind of things that end up getting added directly to the context without being re-interpreted by the dev (e.g. files defining interfaces, or terse behavioral descriptions). Such that devs on different teams are including the same design artifacts in the content while they build adjacent components.

It's like AI generated code is a gas that will fill the available space, so it's the boundaries that require human focus. For this reason I disagree with the article. E2Es and ITs are too slow/expensive to run often enough to be useful constraints for AI. Small tests are way better when you can get away with them.

ak681443 6 months ago

Tests are a way to write your logic twice (once for the code and once during the assertions) with the assumption that you're unlikely to make the same mistake twice.

Integration tests are better replaced by something like contract testing IMO to still retain the test parallelism.

terpimost 6 months ago

Another opinion on changing the testing pyramid https://antithesis.com/blog/testing_pyramid/

[-]

The_Colonel 6 months ago

It's more of an advertisement rather an opinion.

They have a specific definition of "E2E" (apparently UI is not considered to be in it) and it works on docker platform only (so not for e.g. windows binaries). It can be good, but does not speak about the testing pyramid in general.

jillesvangurp 6 months ago

I apply the following reasoning model to testing:

- Integration tests are expensive to run and take time to write; therefore it is important to maximize their value. The ultimate integration test is an end to end test because it maximizes the scope what is under test and the potential for weird feature interactions to trigger exactly those kind of failures you want to find with such a test.

- Unit tests are orders of magnitude cheaper to run; so have lots of them but make sure they are easy to maintain and simple so they minimize time spent on them.

- Anything in between is a compromise between shooting for realism vs. execution speed. Still expensive to run and maintain but it just does not deliver a lot of value.

- Test coverage becomes exponentially harder with the size of the unit you are testing. Test coverage for integration tests is a meaningless notion. With end to end integration tests you shoot for realism, not coverage. They should cover things that users of your system would use in ways that they would use them.

- Mocking and faking is needed to unit test code that violates the SOLID principles and is otherwise hard to test. So they have the development overhead of an integration test but they deliver the value of a unit test. This is not ideal. It's better to unit test code that is very testable and cover the rest with integration tests that deliver more valuable insight. Lots of very complex unit tests are hard to develop and limited in value.

I just removed the one remaining test that used mockk in my Kotlin code base. I have hundreds of API integration tests. And lots of simple unit tests. I focus my unit tests on algorithms, parameters, and those sort of things. My integration tests ensure the system does what it needs to.

I run integration tests concurrently so they complete quickly. This increases their value because it proves the code still works if there's more than one user in the system.

TypingOutBugs 6 months ago

This is just an ad, but in principle I agree integration tests can usually bring more value (depending on the system, but in my general experience across many companies of all sizes).

sagacity 6 months ago

I agree with the sentiment of the article: I've seen too many codebases where a lot of stuff is being tested via unit tests (sometimes even via heavy mocks). The main implication is that your code essentially becomes impossible to refactor, since you're basically testing how your internals are wired up and any change to your code structure immediately causes a big chunk of tests to fail even though on the outside everything still works.

Having said that, I think this mostly means that people find the term "unit" in "unit tests" ambiguous and they're just cargo culting it to mean "a single class" or whatever. That's the fundamental flaw that should be addressed. Basically that's what the article is saying as well, I guess, by implying that the API is the contract you should be testing, etc. But that is essentially just a long winded way of saying "The API is the unit".

drewcoo 6 months ago

Advertisement.

dickiedyce 6 months ago

Actually, according to Anne Elk, it's a test "dinosaur"...

https://www.youtube.com/watch?v=k-t4OiEHCiA (at 5:25)

... which seems appropriate.