Smurf: Beyond the Test Pyramid

(testing.googleblog.com)

47 points | by BerislavLopac 7 hours ago ago

29 comments

  • DanHulton 15 minutes ago

    I _really_ have to dispute the idea that unit tests score the maximum on maintainability. The fact that they are _so_ tightly tied to lower-level code makes your code _miserable_ to maintain. Anyone who's ever had to work on a system that had copious unit tests deep within will know the pain of not just changing code to fix a bug, but having to change a half-dozen tests because your function interfaces have now changed and a healthy selection of your tests refuse to run anymore.

    The "test diamond" has been what I've been working with for a long while now, and I find I greatly prefer it. A few E2E tests to ensure critical system functionality works, a whole whack of integration tests at the boundaries of your services/modules (which should have well-defined interfaces that are unlikely to change frequently when making fixes), and a handful of unit tests for things that are Very Important or just difficult or really slow to test at the integration level.

    This helps keep your test suite size from running away on you (unit tests may be fast, but if you work somewhere that has a fetish for them, it can still take forever to run a few thousand), ensures you have good coverage, and helps reinforce good practices around planning and documentation of your system/module interfaces and boundaries.

  • imiric 5 hours ago

    This is interesting, but I see a few issues with it:

    - Maintainability is difficult to quantify, and often subjective. It's also easy to fall into a trap of overoptimizing or DRYing test code in the pursuit of improving maintainability, and actually end up doing the opposite. Striking a balance is important in this case, which takes many years of experience to get a feel for.

    - I interpret the chart to mean that unit tests have high maintainability, i.e. it's a good thing, when that is often not the case. If anything, unit tests are the most brittle and susceptible to low-level changes. This is good since they're your first safety net, but it also means that you spend a lot of time changing them. Considering you should have many unit tests, a lot of maintenance work is spent on them.

    I see the reverse for E2E tests as well. They're easier to maintain, since typically the high-level interfaces don't change as often, and you have fewer of them.

    But most importantly, I don't see how these definitions help me write better tests, or choose what to focus on. We all know that using fewer resources is better, but that will depend on what you're testing. Nobody likes flaky tests, but telling me that unit tests are more reliable than integration tests won't help me write better code.

    What I would like to see instead are concrete suggestions on how to improve each of these categories, regardless of the test type. For example, not relying on time or sleeping in tests is always good to minimize flakiness. Similarly for relying on system resources like the disk or network; that should be done almost exclusively by E2E and integration tests, and avoided (mocked) in unit tests. There should also be more discussion about what it takes to make code testable to begin with. TDD helps with this, but you don't need to practice it to the letter if you keep some design principles in mind while you're writing code that will make it easier to test later.

    I've seen many attempts at displacing the traditional test pyramid over the years, but so far it's been the most effective guiding tool in all projects I've worked on. The struggle that most projects experience with tests stems primarily from not following its basic principles.

    • js8 4 hours ago

      > There should also be more discussion about what it takes to make code testable to begin with.

      IME, testable pretty much just means referentially transparent.

    • stoperaticless 5 hours ago

      > I don't see how these definitions help me write better tests

      > What I would like to see instead …

      If you hire me, I can address those needs.

    • joshuamorton 4 hours ago

      > If anything, unit tests are the most brittle and susceptible to low-level changes.

      I don't find this to be the case if the unit tests are precise (which they should be).

      That is, if you are writing non-flaky unit tests which do all the "right" unit-testy things (using fakes/dependency injecting well and so isolating and testing only the unit under test), you should end up with a set of tests that

      - Fails only when you change the file/component the test relates to

      - Isn't flaky (can be run ~10000 times without failing)

      - Is quick (you can do the 10000 run loop above approximately interactively, in a few minutes, by running in parallel saturating a beefy workstation)

      This compares to integration/e2e tests which inherently break due to other systems and unrelated assumptions changing (sometimes legitimate, sometimes not), and can have rates of flakyness of 1-10% due to the inherent nature of "real" systems failing to start occasionally and the inherently longer test-debug cycle that makes fixing issues more painful (root causing bug that causes a test to fail 1% of the time is much easier when the test takes .3 CPU-seconds than when it takes 30 or 300 CPU-seconds).

      Very few tests I see are actually unit tests in the above sense, many people only write integration tests because the code under test is structured in inherently un- or difficult- to test ways.

  • shepherdjerred 2 hours ago

    I have found testcontainers to be an excellent way to write integration/end-to-end tests as easily as unit tests.

    It takes care of the chore of setting up test environments, though it won’t solve all of your problems.

    I took this approach when testing an application at my last workplace. It made writing tests significantly easier, and, IMO, fun.

    https://testcontainers.com/

  • sverhagen 5 hours ago

    This model ("mnemonic") feels like a good tool to reason about your testing strategy. I ran across the "testing trophy" in the past, which really changed my thinking already, having been indoctrinated with the testing pyramid for such a long time before that. I wanted to share my favorite links about the testing trophy, for those interested:

    https://tanzu.vmware.com/content/videos/tanzu-tv-springoneto...

    https://kentcdodds.com/blog/the-testing-trophy-and-testing-c...

  • usbsea 6 hours ago

    This is obvious, as anoter commenter said, but this is nonetheless useful.

    You can use it to show graduates. Why have them waste time relearning the same mistakes. You probably need a longer blog post with examples.

    It is useful as a check list, so you can pause when working earlier in the lifecycle to consider these things.

    I think there is power in explaining out the obvious. Sometimes experienced people miss it!

    The diagram can be condensed by saying SMUR + F = 1. IN other words you can slide towards Fidelity, or towards "Nice Testibility" which covers the SMUR properties.

    However it is more complex!

    Let's say you have a unit test for a parser within your code. For a parser a unit test might have pretty much the same fidelity as an intergation test (running the parse from a unit test, rather than say doing a compilation from something like Replit online). But the unit test has all the other properties be the same in this instance.

    Another point is you are not testing anything if you have zero e2e tests. You get a lot (a 99-1 not 80-20) by having some e2e tests, then soon the other type of tests almost always make sense. In addition e2e tests if well written and considers can also be run in production as synthetics.

    • candiddevmike 6 hours ago

      What's useful here? There's nothing actionable, no way to quantify if you're doing "SMURF" correctly. All the article describes is semi-obvious desirable qualities of a test suite.

      • viraptor 5 hours ago

        You're not "doing SMURF". It's not an approach or a system. It's just a specific vocabulary to talk about testing approaches better. They almost spell it out: "The SMURF mnemonic is an easy way to remember the tradeoffs to consider when balancing your test suite".

        It's up to your team (and really always has been) to decide what works best for that project. You get to talk about tradeoffs and what's worth doing.

      • stoperaticless 5 hours ago

        > What's useful here?

        It is up to the reader to figure out this one.

  • nemetroid 5 hours ago

    End-to-end tests verify high-level expectations based on the specification of the system. These high-level expectations generally stay stable over time (at least compared to the implementation details verified by lower-level tests), and therefore end-to-end tests should have the best maintainability score.

    • bastawhiz 3 hours ago

      > end-to-end tests should have the best maintainability score.

      End to end tests encompass far more total details than implementation or unit tests. If you're testing a website, moving a button breaks a test. Making the button have a confirmation breaks the test. The database being slower breaks the tests. The number of items in a paginated list changing breaks the tests. You're testing not just the behavior and output of interfaces, you're testing how they're composed. The marketing team putting a banner in the wrong place breaks the tests. The product team putting a new user tour popover on the wrong button breaks the tests. The support team enabling the knowledge base integration in your on-page support widget breaks the tests.

      Moreover, the cost of fixing the tests is also often higher, because end to end tests are necessarily slower and more complicated. Tests often become flaky because of a larger number of dependencies on external systems. It's often not clear why a test is failing, because the test can't possibly explain why its assertion is no longer true ("The button isn't there" vs "The button now has a slightly different label").

    • wpietri 4 hours ago

      The expectations can be pretty stable, but because they cover so much of the system, they tend to be more fragile. End to end tests are also often flakier because they're dealing with the system at levels meant for human interaction, like by driving actual browsers. Because they encompass so much, they're also the slowest. When you have a problem with an end-to-end test, they can take way more time to debug because of that.

      So I'd agree with them; E2E tests are the hardest to maintain.

  • jbjbjbjb 5 hours ago

    Not to be pedantic, but practically speaking it looks like there are two dimensions: fidelity and then the rest (the SMUR).

  • eiathom 5 hours ago

    It always amazes me how speed and testing are placed in the same bracket. I want solid verification, and a strong pattern to repeat verification no matter what. This then allows for fast implementation. So then something involving integrating a number of components as possible reliably makes the most sense (verification wise): I want to verify value early. It is eyebrow raising this pyrmaid nonsense has hung around.

    • stoperaticless 4 hours ago

      > I want solid verification, and a strong pattern to repeat verification no matter what.

      Everybody has a time budget for test execution time. I doubt that you will wait one year for test suite to finish.

      It is feasible to test all possible inputs for “int x+ int y”, but it is not feasible to test all inputs for usual GUI application.

      Trade offs have to be recognised, some balance must be struck.

      Speed is one of the factors that has to be considered.

      • shepherdjerred 2 hours ago

        You don’t have to consider every possibility but you can randomly search the tree of every possible state.

    • the_af 4 hours ago

      When the test suite is too slow, it becomes unwieldy, it gets in the way (instead of being part of a valuable feedback loop) and everyone starts looking for shortcuts to avoid running it.

      • wpietri 4 hours ago

        For sure. And I'd add that the specificity of unit tests is hugely valuable. If I do some sort of refactoring and found I've broken an end-to-end test, it may tell me very little about where the problem is. But the pattern of unit test breakage is very informative. Often just seeing the list of broken tests is enough to tell me what the issue is without even having to look at an error message.

  • pydry 6 hours ago

    They should at least admit they made a mistake with the "test pyramid".

    • codeflo 6 hours ago

      Not the least of which is that people who take stuff like that as gospel instead of as heuristics tend to endlessly argue what is and isn't a "unit".

    • petesergeant 5 hours ago

      I feel like "testing trophy" has been in vogue for a while now, and definitely feels more right as someone who's made a career of unfucking test suites, but there's almost no area of software engineering as involved in navel-gazing as testing.

    • candiddevmike 6 hours ago

      Conjoined triangles of testing

  • grahamj 6 hours ago

    This is all pretty obvious

    • sverhagen 5 hours ago

      And yet, people focus endlessly on unit tests, pyramid in hand, saying things like: they're the best kind of test, or at least better than integration tests. It needs some maturity to articulate the nuances. And while I've tried, I think SMURF may be a good aid in that. While I've moved away from the religious focus on unit tests, long ago, I appreciate learning about SMURF today.

      • grahamj 4 hours ago

        That's fair. It's easy to get philosophical about such things so something you can point to that's more based on metrics can help a discussion be more objective.

        otoh countering opinion with fact doesn't always work well - it might just turn into quibbling over where on each axis different test types' strengths lie ;)

        • sverhagen 3 hours ago

          I think that a lot of folks have never heard anything else but the testing pyramid, repeated over and over. I find them often very open to other ideas, in my case I've previously heard about the "testing trophy", and found willing audiences.