Limitations of frame pointer unwinding

(developers.redhat.com)

117 points | by rwmj 17 hours ago ago

82 comments

  • elteto 13 hours ago

    Didn’t really get the point of the post as it just presents something without a conclusion.

    9X% of users do not care about a <1% drop in performance. I suspect we get the same variability just by going from one kernel version to another. The impact from all the Intel mitigations that are now enabled by default is much worse.

    However I do care about nice profiles and stack traces without having to jump through hoops.

    Asking people to recompile an _entire_ distribution just to get sane defaults is wrong. Those who care about the last drop should build their custom systems as they see fit, and they probably already do.

    • j16sdiz 2 hours ago

      > 9X% of users do not care about a <1%

      The same could be said about any accessibility issue or minority language translations

      • jchw 6 minutes ago

        This comparison is pretty misleading. An accessibility issue prevents someone from being able to use software effectively. Not having localized text would have a similar impact. A ~1% performance impact on the other hand is the minuscule downside of improving debugging, profiling and error reporting for an entire OS. And that's not just a minority of users, as tons of software will automatically gather stack traces for bug reports.

        There's basically no downside to fixing accessibility issues or adding new language translations other than the work involved in doing so. (And yes, maintaining translations over time is hard, but most projects let them lag during development, so they don't directly hold anything back.) There is a rather glaring downside to this performance optimization, whose upside is sometimes entirely within run-to-run variance and can be blown away by almost any other performance tweak. It's clear the optimization has some upsides, but an extra register and saving some trivial loads/stores just isn't as big of a deal on modern processors that are loaded to the gills with huge caches and deep pipelines.

        I guess I don't care that much about fomit-frame-pointer in the grand scheme of things, but I think enabling it in distributions was ultimately a mistake. If some software packages benefited enough from it, it could've just been done only for those packages. Doing it across the system is questionable at best...

    • yosefk 12 hours ago

      it does present a conclusion. once the kernel supports .sframe it will be all-around superior to -fomit-frame-pointer, and a better default for distros to use.

      • audidude 12 hours ago

        It does cause more memory pressure because the kernel will have to look at the user-space memory for decoding registers.

        So yes it will be faster than alternatives to frame-pointers, but it still wont be as fast as frame pointers.

    • Brian_K_White 12 hours ago

      But does what you care about matter enough to be the default?

      Are you the majority?

      Evaluate "majority" this way: For every/any random binary in a distro, out of all the currently running instances of that binary in the world at any given moment, how many of those need to be profiled?

      There is no way the answer is "most of them".

      You have a job where you profile things, and maybe even you profile almost everything you touch. Your whole world has a high quotient of profiling in it. So you want the whole system built for profiling by default. How convenient for you. But your whole world is not the whole world.

      But it's not just you, there are, zomg thousands, tens of thousands, maybe even hundreds of thousands of developers and ops admins the same as you.

      Yes and? Is even that most installed instances of any given executable? No way.

      Or maybe yes. It's possible. Can you show that somehow? But I will guess no way and not even close.

      • dap 9 hours ago

        > Evaluate "majority" this way: For every/any random binary in a distro, out of all the currently running instances of that binary in the world at any given moment, how many of those need to be profiled? > There is no way the answer is "most of them".

        This is an absurd way to evaluate it. All it takes is one savvy user to report a performance problem that developers are able to root-cause using stack traces from the user's system. Suppose they're able to make a 5% performance improvement to the program. Now all user's programs are 5% faster because of the frame pointers on this one user's system.

        At this point people usually ask: but couldn't developers have done that on their own systems with debug code? But the performance of debug code is not the same as the performance of shipping code. And not all problems manifest the same on all systems. This is why you need shipping code to be debuggable (or instrumentable or profileable or whatever you want to call it).

      • audidude 12 hours ago

        I regularly have users run Sysprof and upload it to issues. It's immensely powerful to be able to see what is going on systems which are having issues. I'd argue it's one of the major reasons GNOME performance has gotten so much better in the recent-past.

        You can't do that when step one is reinstall another distro and reproduce your problem.

        Additionally, the overhead for performance related things that could fall into the 1% range (hint, it's not much) rarely are using the system libraries in such a way anyway that would cause this. They can compile that app with frame-pointers disabled. And for stuff where they do use system libraries (qsort, bsearch, strlen, etc) the frame pointer is negligible to the work being performed. You're margin of error is way larger than the theoretical overhead.

        • Brian_K_White 11 hours ago

          1% is a ton. 1% is crazy. Visa owns the world off just a 3% tax on everything else. Brokers make billions off of just 1% or even far less.

          1% of all activity is only rational if you get more than 1% of all activity back out from those times and places where it was used.

          1%, when it's of everything, is an absolutely stupendous collossal number that is absolutely crazy to try to treat as trivial.

          • ploxiln 11 hours ago

            Better analogy: you're paying 30% to apple, and over 50% in bad payday loans, and you're worried about the 3% visa/stripe overhead ... that's kinda crazy. But that's where we are in computer performance, there's 10x, 100x, and even greater inefficiencies everywhere, 1% for better backtraces is nothing.

            • audidude 10 hours ago

              Absolutely. We've gotten numerous double digit performance improvements across applications, libraries, and system daemons because of frame-pointers in Fedora (and that's just from me).

      • wbl 11 hours ago

        Performance problems matter to the people who have them, who often are in an inconvenient place. Having the ability for profiling to just work means that it's easy to help these people.

      • elteto 11 hours ago

        I think you are trying to make this out something that it isn’t.

        Visibility at the “cost” of negligible impact is more important than raw performance. That’s it.

        I’m a regular user of Linux with some performance sensitivity that does not go as far as “I _need_ that extra register!”. That’s what the majority of developers working on Linux are like. I think it’s up to _you_ to prove the contrary.

      • PittleyDunkin 10 hours ago

        This seems like a ridiculous attempt to bury your head in the sand. Is there any evidence anyone doesn't want frame pointers?

        • Brian_K_White 10 hours ago

          I think it's ridiculous to question that since obviously, yes, many people have decided exactly that. I see no point myself and I'm even in the field. And I am not in charge of all the distributions which disabled it by default.

          So, "yes". In fact "yes, duh?" Talk about head in sand...

          • PittleyDunkin 10 hours ago

            Ok, where's the evidence?

            > I see no point myself and I'm even in the field.

            You don't see the point of readable stack traces?

            • Brian_K_White 9 hours ago

              Nope. Not on 99.999% of installed binaries in existence and running at a given moment.

              • PittleyDunkin 8 hours ago

                That strikes me as an insane take (not to mention blatantly inaccurate), but I take your point that this is a common one for distribution-maintainers to have.

    • josefx 9 hours ago

      > 9X% of users do not care about a <1% drop in performance.

      Except Python got opted out of the frame pointer change due to benchmarks showing slowdowns of up to 10%. The discussion around that had the great idea of just adding a pragma to flat out override the build setting. So in the end that "%1" reduction claim only holds if everything even remotely affected silently ignores the flag.

      • audidude 7 hours ago

        This is a bit of a mischaracterization of the Python side of things.

        They only opted out for 3.11 which did not yet have the perf-integration fixes anyway. 3.12 uses frame-pointers just fine.

        • josefx 5 hours ago

          Any link to the fix or documentation about it? I could find added perf support but did not see anything about improved performance related to frame pointer use.

    • baq 12 hours ago

      fifty independent 1% performance drops nobody cares about compound to a ~40% reduction.

  • audidude 14 hours ago

    I added support to Sysprof this weekend for unwinding using libdwfl and DWARF/CFI/eh_frame/etc techniques that Serhei did in eu-stacktrace.

    The overhead is about 10% of samples. But at least you can unwind on systems without frame-pointers. Personally I'll take the statistical anomalies of frame-pointers which still allow you to know what PID/TID are your cost center even if you don't get perfect unwinds. Everyone seems motivated towards SFrame going forward, which is good.

    https://blogs.gnome.org/chergert/2024/11/03/profiling-w-o-fr...

  • ot 12 hours ago

    I broadly agree with the thesis of the post, which if I understand correctly is that frame pointers are a temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers, and it seems that we'll eventually get there.

    Some of the statements in the post seem odd to me though.

    - 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.

    - Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.

    - Even if we're in the prologue, we know that we're in a leaf call, we can still resolve the instruction pointer to the function, and we can read the return address to find the parent, so what information is lost?

    When it comes to future alternatives, while frame pointers have their own problems, I think that there are still a few open questions:

    - Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?

    - Is the memory overhead of lookup tables for very large programs acceptable?

    • rwmj 12 hours ago

      > - Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.

      I believe it's because of the landing pad for Control Flow Integrity which basically all functions now need. Grabbing main() from a random program on Fedora (which uses frame pointers):

          0000000000007000 <main>:
          7000:       f3 0f 1e fa       endbr64     ; landing pad
          7004:       55                push   %rbp ; set up frame pointer
          7005:       48 89 e5          mov    %rsp,%rbp
      
      It's not much of an issue in practice as the stack trace will still be nearly correct, enough for you to identify the problematic area of the code.

      > - Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?

      Yes shadow stacks are limited to 32 entries on the most recent Intel CPUs (and as little as 4 entries on very old ones). However they are basically cost free so that's a big advantage.

      I think SFrame is a sensible middle ground here. It's saner than DWARF and has a long history of use in the kernel so we know it will work.

      • Sesse__ 11 hours ago

        If you're limited to 32 entries, why not just use LBR, then? It has basically the same pros and cons.

    • Sesse__ 11 hours ago

      > - 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.

      TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)

      • manwe150 10 hours ago

        This sounds to me more like an inlining problem than an ABI problem. If the calls take as much time than the running, perhaps you just need a better language that doesn’t arbitrarily prevent inlining due to compilation boundaries (eg. basically any modern language that isn’t in the C/C++ family, before LTO)

        • Sesse__ 9 hours ago

          I see this in LTO/PGO binaries as well. If a function is 20 instructions long, it's not like you can inline it uncritically, yet a five-cycle prologue and a five-cycle epilogue will hurt. (Also, recursive functions etc.)

    • audidude 12 hours ago

      > Shadow stacks are cool but aren't they limited to a fixed number of entries?

      Current available hardware yes. But I think some of the future Intel stuff was going to allow for much larger depth.

      > Is the memory overhead of lookup tables for very large programs acceptable?

      I don't think SFrame is as "dense" as DWARF as a format so you trade a bit of memory size for a much faster unwind experience. But you are definitely right that this adds memory pressure that could otherwise be ignored.

      Especially if the anomalies are what they sound like, just account for them statistically. You get a PID for cost accounting in the perf_event frame anyway.

    • quotemstr 11 hours ago

      > temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers,

      Temporary solutions have a way of becoming permanent. I was against the recent frame pointer enablement on the grounds of moral hazard. I still think it would have been better to force the ecosystem to get its act together first.

      Another factor nobody is talking about is JITed and interpreted languages. Whatever the long-term solution might be, it should enable stack traces that interleave accurate source-level frame information from native and managed code. The existing perf /tmp file hack is inadequate in many ways, including security, performance, and compatibility with multiple language runtimes coexisting in a single process.

      • audidude 9 hours ago

        It's a disaster no doubt.

        But, at least from the GNOME side of things, we've been complaining about it for roughly 15 years and kept getting push-back in the form of "we'll make something better".

        Now that we have frame-pointers enabled in Fedora, Ubuntu, Arch, etc we're starting to see movement on realistic alternatives. So in many ways, I think the moral hazard was waiting until 2023 to enable them.

  • Brian_K_White 14 hours ago

    Is this a response to Alma Kitten?

    In any event I don't understand why frame pointers need to be in by default instead of developers enabling where needed.

    Having Kitten include pointers by default seems reasonable enough, since Kitten is a devel system.

    • adrian_b 12 hours ago

      In reality all this discussion has its origin in a design mistake made by Intel already in the 8086 CPU, when it was launched in 1978.

      They have designed the instruction set in such a way that two distinct registers were necessary for fulfilling the roles of the stack pointer and of the frame pointer.

      In better designed instruction sets, for example in IBM POWER, a single register is enough for fulfilling both roles, simultaneously being both stack pointer and frame pointer.

      Unfortunately, the Intel designers have not thought at all about this problem, but in 1978 they have just followed the example of the architectures popular at that time, e.g. DEC VAX, which had also made the same mistake of reserving two distinct registers for the roles of stack pointer and of frame pointer.

      In the architectures where a single register plays both roles, the stack pointer always points to a valid stack frame that is a part of a linked list of all stack frames. For this to work, there must be an atomic instruction for both creating a new stack frame (which consists in storing the old frame pointer in the right place of the new stack frame) and updating the stack pointer to point to the new stack frame. The Intel/AMD ISA does not have such an atomic instruction, and this is the reason why two registers are needed for creating a new stack frame in a safe way (safe means that the frame pointer always points to a valid stack frame and the stack pointer always points to the top of stack).

    • rwmj 14 hours ago

      The real benefit is being able to turn on profiling when a problem is spotted, or in some cases to be able to profile continuously in production (as apparently they do at Netflix).

      • thegeomaster 13 hours ago

        I get it. This frustrated me to no end. But still I did what I had to do --- recompiled random software throughout the stack, enabled random flags, etc. It was doable and now I can do it much faster. I don't think it's fair for upstream to disable a useful optimization just so I don't have to do this additional work to fix and optimize my system.

        • rwmj 13 hours ago

          Doing real world, whole system profiling, we've found performance was affected by completely unexpected software running on the system. Recompiling the entire distribution, or even the subset of all software installed, is not realistic for most people. Besides, I have measured the overhead of frame pointers and it's less than 1%, so there's not really any trade-off here.

          Anyway, soon we'll have SFrame support in the userspace tools and the whole issue will go away.

          • thegeomaster 11 hours ago

            In one of my jobs, a 1% perf regression (on a more stable/reproducible system, not PCs) was a reason for a customer raising a ticket, and we'd have to look into it. For dynamically dispatched but short functions, the overhead is easily more than 1% too. So, there is a trade-off, just not one that affects you.

            • Dylan16807 17 minutes ago

              If 1% shows up out of nowhere, it's very much worth investigation and trying to fix it. You shouldn't let them freely happen and pile up.

              But there are some 1% costs that are worth it.

        • Brian_K_White 13 hours ago

          I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?

          If so, then it should be the default.

          If it's a close call, then there should be 2 versions of the iso and repos.

          As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.

          The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?

          So it's not just unsympathetic "F developers/services problems". I are one myself.

          • recursivecaveat 12 hours ago

            Everyone benefits from the net performance wins that come from an ecosystem where everyone can easily profile things. I have no doubt that works out to more than a 1% lifetime improvement. Same reason you log stuff on your servers. 99.9% pure overhead, never even seen by a human. Slows stuff down, even causes uptime issues sometimes from bugs or full discs. It's still worthwhile though because occasionally it makes fixes or enhancements possible that are so much larger than the cost of the observability.

          • nemetroid 13 hours ago

            Do 50% of users need to be able to:

            * modify system services?

            * run a compiler?

            * add custom package repositories?

            * change the default shell?

            I believe the answer to all of the above is "no".

            • redox99 13 hours ago

              All those things are free in terms of performance though.

            • Brian_K_White 13 hours ago

              I don't see how this applies. Some shell has to be the default one, and all systems don't pick the same one even. Most systems don't install a compiler by default. Thank you for making my point?

              • nemetroid 13 hours ago

                All these things are possible to do, even though only developers need them. Why shouldn’t the same be true for useful profiling abilities? Because of the 1-2% penalty?

                • Brian_K_White 12 hours ago

                  Are you serious?

                  Visa makes billions per year off of nothing but collecting a mere 2%-3% tax on everything else.

                  • oasisaimlessly 11 hours ago

                    I don't see how Visa is in any way relevant here.

                    • Brian_K_White 10 hours ago

                      I don't see why not.

                      The whole point of an analogy is to expose a blind spot by showing the same thing in some other context where it is recognized or percieved differently.

      • loeg 11 hours ago

        Meta also continuously profiles in production, FWIW.

      • Brian_K_White 13 hours ago

        Then Netflix can enable it for their systems? Are they actually still profiling cat and ls that come from the os or are they profiling their own applications and the interpreters and daemons they run on?

        This does not explain why a distribution should have such a feature on by default. It only explains why Netflix wants it on some of their systems.

        • soraminazuki 13 hours ago

          People across the industry are suffering from incomplete stacktraces because their applications call into libraries like glibc or OpenSSL that has frame pointer optimization enabled by their distro. It's pretty ridiculous to have to pull off a Linux from Scratch on CentOS just to get a decent stacktrace. Needless to say, this has nothing at all to do with profiling cat and ls.

          • pkhuong 13 hours ago

            OpenSSL is the worst because some configurations execute asm generated by a specialised program. That code clobbers the frame pointer (gotta go fast!) but isn't annotated with dwarf unwinding info (what do you mean you want to know what lead to your app crashing in OpenSSL?)...

        • the_mitsuhiko 13 hours ago

          > Then Netflix can enable it for their systems?

          And they did.

          The question is though why only Netflix should benefit from that. It takes a lot of effort to recompile an entire Linux distribution.

        • Brian_K_White 13 hours ago

          Quoting my other comment in this thread:

          ---

          I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?

          If so, then it should be the default.

          If it's a close call, then there should be 2 versions of the iso and repos.

          As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.

          The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?

          So it's not just unsympathetic "F developers/services problems". I are one myself.

          ---

          "people across the industry" is a meaningless and valueless term and is an empty argument.

          • Dylan16807 12 minutes ago

            > such that they just need to be already built this way without having to identify them and prepare them ahead of time

            I think a big enough fraction of potentially-useful crash reports come from those systems to make it a good default.

    • brenns10 11 hours ago

      > I don't understand why frame pointers need to be in by default instead of developers enabling where needed

      If you enable frame pointers, you need to recompile every library your executable depends on. Otherwise, the unwind will fail at the first function that's not part of your executable. Usually library function calls (like glibc) are at the top of the stack, so for a large portion of the samples in a typical profile, you won't get any stack unwind at all.

      In many (most?) cases recompiling all those libraries is just infeasible for the application developers, which is why the distro would need to do it. Developers can still choose whether to include frame pointers in their own applications (and so they can still pick up those 1-2% performance gains in their own code). But they're stuck with frame pointers enabled on all the distro provided code.

      So the choice developers get to make is more along the lines of: should they use a distro with FP or without. Which is definitely not ideal, but that's life.

    • ithkuil 14 hours ago

      It's useful to be able to profile on production workloads

  • fooblaster 13 hours ago

    I have always had issues with the perf call trace sampling with frame pointers, even when virtually everything in userspace compiled with fno-omit-frame-pointer. It doesn't look like any of the failure modes listed in the article to me though. Shrug.

    FYI, if you happen to be running on an intel cpu, --call-graph lbr uses some specicalized hardware and often delivers a far superior result, with some notable failure modes. Really looking forward to when AMD implements a similar feature.

    • rwmj 13 hours ago

      The problem with Intel LBR (last branch records) is that the depth of the call stack is relatively limited. It depends on the generation of CPU, but LWN has a table here: https://lwn.net/Articles/680985/ Anything less than 32 is fairly useless for profiling from the kernel through to userspace.

      • fooblaster 13 hours ago

        Yeah, right. I still seem to get far more "comprehensible" traces when using it, even with this limitation. It's often really easy to localize where a trace is coming from, even when truncated. It probably breaks flamegraphs though.

    • Sesse__ 11 hours ago

      I've tried --call-graph lbr a bunch of times, but often, it… just returns junk? I don't fully understand why, it sometimes returns wild pointers even if you don't have deep stacks.

      • fooblaster 11 hours ago

        I often get junk when sampling without lbr. Which kernel are you running? The quality of perf and the associated perf_events varies wildly across kernel versions.

        • Sesse__ 11 hours ago

          A variety of kernels over the last five years, on a multitude of Intel CPUs. :-) I last tested this on 6.10, I think.

          It's certainly true that there can be junk in --call-graph fp, too.

  • clausecker 13 hours ago

    The “function prologue is at least 8 bytes long” bit only applies if CET is used. If it is not used, the endbr64 instruction is not emitted and the prologue is only 4 bytes long.

  • laserbeam 13 hours ago

    "enabling frame pointers is a 1-2% performance loss, which translates to the loss of about 1 or 2 years of compiler improvements"

    Wait, are we really that close to the maximum of what a compiler can optimize that we're getting barely 1% performance improvements per year with new versions?

    • thechao 13 hours ago

      As a part time compiler author I'm extremely skeptical we're getting a global 1–2%/yr. I'd've thought more like a tenth to half that? I've not seen any numbers, so I'm just making shit up.

      However, for sure, if compiler optimizations disappeared, HW would pick up the slack in a few years.

    • variadix 11 hours ago

      There’s likely a lot of performance still on the table if compilers were permitted to change data structure layout, but I think doing this effectively is an open problem.

      Current compilers could do a lot better with vectorization, but it will often be limited by the data structure layout.

    • fanf2 13 hours ago

      Proebsting’s Law suggests 4% per year, but as a satirical joke it seems to have underdone its cynicism.

      https://gwern.net/doc/cs/algorithm/2001-scott.pdf

    • clausecker 13 hours ago

      Yeah, compilers are already pretty close to the limit of what is possible, unless your code is unusually poorly written.

      • londons_explore 12 hours ago

        Clearly this isn't the case. Plenty of neat C++ "reference implementation" code ends up 5x faster when hand optimized, parallelized, vectorized, etc.

        There are some transformations that compilers are really bad at. Rearranging data structures, switching out algorithms for equivalent ones with better big-O complexity, generating & using lookup tables, bit-packing things, using caches, hash tables and bloom filters for time/memory trade offs, etc.

        The spec doesn't prevent such optimizations, but current compilers aren't smart enough to find them.

        • adrianN 12 hours ago

          Imagine the outcry if compilers switched algorithms. How can the compiler know my input size and input distribution? Maybe my dumb algorithm is optimal for my data.

          • londons_explore 11 hours ago

            Compilers can easily runtime-detect the size and shape of the problem, and run different code for different problem sizes. Many already do for loop-unrolling. Ie. if you memcpy 2 bytes, they won't even branch into the fancy SIMD version.

            This would just be an extension of that. If the code creates and uses a linked list, yet the list is 1M items long and being accessed entirely by index, branch to a different version of the code which uses an array, etc.

        • laserbeam 12 hours ago

          That's my question. I'm also under the impression that optimizations CAN be made manyally, but I find it surprising that "current compilers aren't smart enought to find them" isn't improving

          • londons_explore 10 hours ago

            The percentage of all software engineers working on compilers is probably lower now than it ever has been...

  • tempfile 12 hours ago

    The JS constantly grabbing the anchor and updating it is absolutely appalling UX. It took me something like 11 back button presses to get back to where I was. Borderline malware.

  • jeffbee 13 hours ago

    Complaining about frame pointers is like complaining about the budget of the Bureau of Labor Statistics. Yes, it's pure overhead, but also yes, it's good to know what is going on.

  • dap 15 hours ago

    This reads to me like FUD. Isn’t the fraction of profile samples in a prologue heavily workload dependent? And whichever way you go on frame pointers, there are winners and losers to including them by default.

    • thegeomaster 13 hours ago

      There are no performance winners if you include them by default. There will be an additional >0% overhead when you are executing additional code in the prologue and epilogue, and increasing the register pressure by removing rbp from being ever allocated.

      There are only "winners" in the sense that people will be able to more easily see why their never-tuned system is so slow. On the other hand, you're punishing all perf-critical usecases with unnecessary overhead.

      I believe if you have a slow system, it's up to you to profile and optimize it, and that includes even recompiling some software with different flags to enable profiling. It's not the job of upstream to make this easier for you if it means punishing those workloads where teams have diligently profiled and optimized through the years so that there is no, as the author says, low-hanging fruit to find.

      • dap 12 hours ago

        I’ve been around long enough to have had frame pointers pretty ubiquitously, then lost them, and now starting to have them again. The dark times in the middle were painful. For the software I’ve worked on, the easy dynamic profiling using frame pointers (eg using DTrace) has given far more in performance wins than omitting them would have. (Part of my beef with the article is that while edge cases do break some samples, in practice it’s a very small fraction, and almost by definition not the important ones if you’re trying to find heavy on-CPU code paths.)

        I get that some use cases may be better without frame pointers. A well-resourced team can always recompile the world, whichever the default is. It’s just that my experience is that most software is not already perfectly tuned and I’d much rather the default be more easily observable.

        • thegeomaster 11 hours ago

          Look, it's likely we just come from different backgrounds. Most of my perf-sensitive work was optimizing inner loops with SIMD, allowing the compiler to inline hot functions, creating better data structures to make use of the CPU cache, etc. Frame pointer prologue overhead was measurable on most of our use-cases. I have a smaller amount of experience on profiling systems where calls trace across multiple processes, so maybe I haven't felt this pain enough. Though I still think the onus should be on teams to be able to comfortably recompile---not the world---but some part of it. After all, a lot of tuning can only be done through compile flags, such as turning off codepaths/capabilities which are unnecessary.

          • audidude 9 hours ago

            I think your viewpoint is valid.

            My experience is on performance tuning the other side you mention. Cross-application, cross-library, whole-system, daemons, etc. Basically, "the whole OS as it's shipped to users".

            For my case, I need the whole system setup correctly before it even starts to be useful. For your case, you only need the specific library or application compiled correctly. The rest of the system is negligible and probably not even used. Who would optimize SIMD routines next to function calls anyway?

          • dap 10 hours ago

            Makes sense.

            I wasn't exaggerating about recompiling the world, though. Even if we say I'm only interested in profiling my application, a single library compiled without frame pointers makes useless any samples where code in that library was at the top of the stack. I've seen that be libc, openssl, some random Node module or JNI thing, etc. You can't just throw out those samples because they might still be your application's problem. For me in those situations, I would have needed to recompile most of the packages we got from both the OS distro and the supplemental package repo.