FBDetect: Catching Tiny Performance Regressions at Hyperscale [pdf]

(tangchq74.github.io)

35 points | by pjmlp 3 days ago ago

7 comments

  • dataflow 3 hours ago

    Detecting a 0.005% regression means detecting that a 20s task now takes 20.001s.

    It's not even easy to reliably detect such a small performance regression on a single thread on a single machine.

    I suppose in theory having multiple machines could actually improve the situation, by letting them average out the noise? But on the other hand, it's not like you have identically distributed samples to work with - workloads have variance over time and space, so there's extra noise that isn't uniform across machines.

    Color me a little skeptical, but it's super cool if actually true.

    • yuliyp 2 hours ago

      The .005% is a bit of a headline-grabber for sure, but the idea makes sense given the context: They're monitoring large, multi-tenant/multi-use-case systems which have very large amounts of diverse traffic. In these cases, the regression may be .005% of the overall size, but you don't detect it like that, but rather by detecting a 0.5% regression in a use case which was 1% of the cost. They can and do slice data in various ways (group by endpoint, group by function, etc.) to improve the odds of detecting a regression.

    • vlovich123 an hour ago

      You’re right to be skeptical. This entire space as far as I can tell is filled with people who overpromise and under deliver and use bad metrics to claim success. If you look at their false positive and false negative sections, they perform terribly but use words to claim that it’s actually good and use flawed logic to extrapolate on missing data (eg assume our rates stay the same for non-response vs “people are tired of our tickets and ignore our system”). And as follow up work their solution is to keep tuning their parameters (ie keep fiddling to overfit past data”). You can even tell how it’s perceived where they describe people not even bother to interact with their system during 2/4 high impact incidents examined and blaming the developer for one of them as “they didn’t integrate the metrics”. Like if a system can provide a meaningful cost/benefit the teams would be clamoring to adjust their processes. Until demonstrated clearly otherwise it’s dressed up numerology.

      I saw a team at oculus fail to do this for highly constrained isolated environments with repeatable workloads and having the threshold be much more conservative (eg 1-10%) and failing. This paper is promulgating filtering your data all to hell to the point of overfitting.

      • dataflow an hour ago

        Thanks, I hadn't read that far into the paper. But I have to say I had what I feel is a good reason to be skeptical before even reading a single word in the paper, honestly. Which is that Facebook never felt so... blazing fast, shall we say, to make me believe anyone even wanted to pay attention to tiny performance regressions, let alone the drive and tooling to do so.

        • vlovich123 an hour ago

          > blazing fast, shall we say, to make me believe anyone even wanted to pay attention to tiny performance regressions

          Important to distinguish frontend vs backend performance of course. This is about backend performance where they care about this stuff a lot because it multiplies at scale & starts costing them real money. Frontend performance has less of a direct impact on their numbers with the only data I know on that is the oft-cited Google stuff trying to claim that there's a direct correlation between lost revenue and latency (which I haven't seen anyone else bother to try to replicate & see if it holds up).

          • dataflow 39 minutes ago

            > Important to distinguish frontend vs backend performance of course.

            I'm not sure what you're calling frontend in this context (client side? or client-facing "front-end" servers?), but I'm talking about their server side specifically. I've looked at their API calls and understand when slowness is coming from the client or the server. The client side is even slower for sure, but the server side also never felt like it was optimized to the point where deviations this small mattered.

    • protomolecule 17 minutes ago

      "...measuring CPU usage at the subroutine level rather than at the overall service level. However, if this 0.005% regression originates from a single subroutine that consumes 0.1% of the total CPU, the relative change at the subroutine level is 0.005% / 0.1% = 5%, which is much more substantial. Consequently, small regressions are easier to detect at the subroutine level."