Didn’t really get the point of the post as it just presents something without a conclusion.
9X% of users do not care about a <1% drop in performance. I suspect we get the same variability just by going from one kernel version to another. The impact from all the Intel mitigations that are now enabled by default is much worse.
However I do care about nice profiles and stack traces without having to jump through hoops.
Asking people to recompile an _entire_ distribution just to get sane defaults is wrong. Those who care about the last drop should build their custom systems as they see fit, and they probably already do.
This comparison is pretty misleading. An accessibility issue prevents someone from being able to use software effectively. Not having localized text would have a similar impact. A ~1% performance impact on the other hand is the minuscule downside of improving debugging, profiling and error reporting for an entire OS. And that's not just a minority of users, as tons of software will automatically gather stack traces for bug reports.
There's basically no downside to fixing accessibility issues or adding new language translations other than the work involved in doing so. (And yes, maintaining translations over time is hard, but most projects let them lag during development, so they don't directly hold anything back.) There is a rather glaring downside to this performance optimization, whose upside is sometimes entirely within run-to-run variance and can be blown away by almost any other performance tweak. It's clear the optimization has some upsides, but an extra register and saving some trivial loads/stores just isn't as big of a deal on modern processors that are loaded to the gills with huge caches and deep pipelines.
I guess I don't care that much about fomit-frame-pointer in the grand scheme of things, but I think enabling it in distributions was ultimately a mistake. If some software packages benefited enough from it, it could've just been done only for those packages. Doing it across the system is questionable at best...
it does present a conclusion. once the kernel supports .sframe it will be all-around superior to -fomit-frame-pointer, and a better default for distros to use.
But does what you care about matter enough to be the default?
Are you the majority?
Evaluate "majority" this way: For every/any random binary in a distro, out of all the currently running instances of that binary in the world at any given moment, how many of those need to be profiled?
There is no way the answer is "most of them".
You have a job where you profile things, and maybe even you profile almost everything you touch. Your whole world has a high quotient of profiling in it. So you want the whole system built for profiling by default. How convenient for you. But your whole world is not the whole world.
But it's not just you, there are, zomg thousands, tens of thousands, maybe even hundreds of thousands of developers and ops admins the same as you.
Yes and? Is even that most installed instances of any given executable?
No way.
Or maybe yes. It's possible. Can you show that somehow? But I will guess no way and not even close.
> Evaluate "majority" this way: For every/any random binary in a distro, out of all the currently running instances of that binary in the world at any given moment, how many of those need to be profiled?
> There is no way the answer is "most of them".
This is an absurd way to evaluate it. All it takes is one savvy user to report a performance problem that developers are able to root-cause using stack traces from the user's system. Suppose they're able to make a 5% performance improvement to the program. Now all user's programs are 5% faster because of the frame pointers on this one user's system.
At this point people usually ask: but couldn't developers have done that on their own systems with debug code? But the performance of debug code is not the same as the performance of shipping code. And not all problems manifest the same on all systems. This is why you need shipping code to be debuggable (or instrumentable or profileable or whatever you want to call it).
I regularly have users run Sysprof and upload it to issues. It's immensely powerful to be able to see what is going on systems which are having issues. I'd argue it's one of the major reasons GNOME performance has gotten so much better in the recent-past.
You can't do that when step one is reinstall another distro and reproduce your problem.
Additionally, the overhead for performance related things that could fall into the 1% range (hint, it's not much) rarely are using the system libraries in such a way anyway that would cause this. They can compile that app with frame-pointers disabled. And for stuff where they do use system libraries (qsort, bsearch, strlen, etc) the frame pointer is negligible to the work being performed. You're margin of error is way larger than the theoretical overhead.
Better analogy: you're paying 30% to apple, and over 50% in bad payday loans, and you're worried about the 3% visa/stripe overhead ... that's kinda crazy. But that's where we are in computer performance, there's 10x, 100x, and even greater inefficiencies everywhere, 1% for better backtraces is nothing.
Absolutely. We've gotten numerous double digit performance improvements across applications, libraries, and system daemons because of frame-pointers in Fedora (and that's just from me).
Performance problems matter to the people who have them, who often are in an inconvenient place. Having the ability for profiling to just work means that it's easy to help these people.
I think you are trying to make this out something that it isn’t.
Visibility at the “cost” of negligible impact is more important than raw performance. That’s it.
I’m a regular user of Linux with some performance sensitivity that does not go as far as “I _need_ that extra register!”. That’s what the majority of developers working on Linux are like. I think it’s up to _you_ to prove the contrary.
I think it's ridiculous to question that since obviously, yes, many people have decided exactly that. I see no point myself and I'm even in the field. And I am not in charge of all the distributions which disabled it by default.
So, "yes". In fact "yes, duh?" Talk about head in sand...
That strikes me as an insane take (not to mention blatantly inaccurate), but I take your point that this is a common one for distribution-maintainers to have.
> 9X% of users do not care about a <1% drop in performance.
Except Python got opted out of the frame pointer change due to benchmarks showing slowdowns of up to 10%. The discussion around that had the great idea of just adding a pragma to flat out override the build setting. So in the end that "%1" reduction claim only holds if everything even remotely affected silently ignores the flag.
Any link to the fix or documentation about it? I could find added perf support but did not see anything about improved performance related to frame pointer use.
I added support to Sysprof this weekend for unwinding using libdwfl and DWARF/CFI/eh_frame/etc techniques that Serhei did in eu-stacktrace.
The overhead is about 10% of samples. But at least you can unwind on systems without frame-pointers. Personally I'll take the statistical anomalies of frame-pointers which still allow you to know what PID/TID are your cost center even if you don't get perfect unwinds. Everyone seems motivated towards SFrame going forward, which is good.
I broadly agree with the thesis of the post, which if I understand correctly is that frame pointers are a temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers, and it seems that we'll eventually get there.
Some of the statements in the post seem odd to me though.
- 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
- Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.
- Even if we're in the prologue, we know that we're in a leaf call, we can still resolve the instruction pointer to the function, and we can read the return address to find the parent, so what information is lost?
When it comes to future alternatives, while frame pointers have their own problems, I think that there are still a few open questions:
- Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
- Is the memory overhead of lookup tables for very large programs acceptable?
> - Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.
I believe it's because of the landing pad for Control Flow Integrity which basically all functions now need. Grabbing main() from a random program on Fedora (which uses frame pointers):
0000000000007000 <main>:
7000: f3 0f 1e fa endbr64 ; landing pad
7004: 55 push %rbp ; set up frame pointer
7005: 48 89 e5 mov %rsp,%rbp
It's not much of an issue in practice as the stack trace will still be nearly correct, enough for you to identify the problematic area of the code.
> - Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
Yes shadow stacks are limited to 32 entries on the most recent Intel CPUs (and as little as 4 entries on very old ones). However they are basically cost free so that's a big advantage.
I think SFrame is a sensible middle ground here. It's saner than DWARF and has a long history of use in the kernel so we know it will work.
> - 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)
This sounds to me more like an inlining problem than an ABI problem. If the calls take as much time than the running, perhaps you just need a better language that doesn’t arbitrarily prevent inlining due to compilation boundaries (eg. basically any modern language that isn’t in the C/C++ family, before LTO)
I see this in LTO/PGO binaries as well. If a function is 20 instructions long, it's not like you can inline it uncritically, yet a five-cycle prologue and a five-cycle epilogue will hurt. (Also, recursive functions etc.)
> Shadow stacks are cool but aren't they limited to a fixed number of entries?
Current available hardware yes. But I think some of the future Intel stuff was going to allow for much larger depth.
> Is the memory overhead of lookup tables for very large programs acceptable?
I don't think SFrame is as "dense" as DWARF as a format so you trade a bit of memory size for a much faster unwind experience. But you are definitely right that this adds memory pressure that could otherwise be ignored.
Especially if the anomalies are what they sound like, just account for them statistically. You get a PID for cost accounting in the perf_event frame anyway.
> temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers,
Temporary solutions have a way of becoming permanent. I was against the recent frame pointer enablement on the grounds of moral hazard. I still think it would have been better to force the ecosystem to get its act together first.
Another factor nobody is talking about is JITed and interpreted languages. Whatever the long-term solution might be, it should enable stack traces that interleave accurate source-level frame information from native and managed code. The existing perf /tmp file hack is inadequate in many ways, including security, performance, and compatibility with multiple language runtimes coexisting in a single process.
But, at least from the GNOME side of things, we've been complaining about it for roughly 15 years and kept getting push-back in the form of "we'll make something better".
Now that we have frame-pointers enabled in Fedora, Ubuntu, Arch, etc we're starting to see movement on realistic alternatives. So in many ways, I think the moral hazard was waiting until 2023 to enable them.
In reality all this discussion has its origin in a design mistake made by Intel already in the 8086 CPU, when it was launched in 1978.
They have designed the instruction set in such a way that two distinct registers were necessary for fulfilling the roles of the stack pointer and of the frame pointer.
In better designed instruction sets, for example in IBM POWER, a single register is enough for fulfilling both roles, simultaneously being both stack pointer and frame pointer.
Unfortunately, the Intel designers have not thought at all about this problem, but in 1978 they have just followed the example of the architectures popular at that time, e.g. DEC VAX, which had also made the same mistake of reserving two distinct registers for the roles of stack pointer and of frame pointer.
In the architectures where a single register plays both roles, the stack pointer always points to a valid stack frame that is a part of a linked list of all stack frames. For this to work, there must be an atomic instruction for both creating a new stack frame (which consists in storing the old frame pointer in the right place of the new stack frame) and updating the stack pointer to point to the new stack frame. The Intel/AMD ISA does not have such an atomic instruction, and this is the reason why two registers are needed for creating a new stack frame in a safe way (safe means that the frame pointer always points to a valid stack frame and the stack pointer always points to the top of stack).
The real benefit is being able to turn on profiling when a problem is spotted, or in some cases to be able to profile continuously in production (as apparently they do at Netflix).
I get it. This frustrated me to no end. But still I did what I had to do --- recompiled random software throughout the stack, enabled random flags, etc. It was doable and now I can do it much faster. I don't think it's fair for upstream to disable a useful optimization just so I don't have to do this additional work to fix and optimize my system.
Doing real world, whole system profiling, we've found performance was affected by completely unexpected software running on the system. Recompiling the entire distribution, or even the subset of all software installed, is not realistic for most people. Besides, I have measured the overhead of frame pointers and it's less than 1%, so there's not really any trade-off here.
Anyway, soon we'll have SFrame support in the userspace tools and the whole issue will go away.
In one of my jobs, a 1% perf regression (on a more stable/reproducible system, not PCs) was a reason for a customer raising a ticket, and we'd have to look into it. For dynamically dispatched but short functions, the overhead is easily more than 1% too. So, there is a trade-off, just not one that affects you.
I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?
If so, then it should be the default.
If it's a close call, then there should be 2 versions of the iso and repos.
As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.
The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?
So it's not just unsympathetic "F developers/services problems". I are one myself.
Everyone benefits from the net performance wins that come from an ecosystem where everyone can easily profile things. I have no doubt that works out to more than a 1% lifetime improvement. Same reason you log stuff on your servers. 99.9% pure overhead, never even seen by a human. Slows stuff down, even causes uptime issues sometimes from bugs or full discs. It's still worthwhile though because occasionally it makes fixes or enhancements possible that are so much larger than the cost of the observability.
I don't see how this applies. Some shell has to be the default one, and all systems don't pick the same one even. Most systems don't install a compiler by default. Thank you for making my point?
All these things are possible to do, even though only developers need them. Why shouldn’t the same be true for useful profiling abilities? Because of the 1-2% penalty?
The whole point of an analogy is to expose a blind spot by showing the same thing in some other context where it is recognized or percieved differently.
Then Netflix can enable it for their systems? Are they actually still profiling cat and ls that come from the os or are they profiling their own applications and the interpreters and daemons they run on?
This does not explain why a distribution should have such a feature on by default. It only explains why Netflix wants it on some of their systems.
People across the industry are suffering from incomplete stacktraces because their applications call into libraries like glibc or OpenSSL that has frame pointer optimization enabled by their distro. It's pretty ridiculous to have to pull off a Linux from Scratch on CentOS just to get a decent stacktrace. Needless to say, this has nothing at all to do with profiling cat and ls.
OpenSSL is the worst because some configurations execute asm generated by a specialised program. That code clobbers the frame pointer (gotta go fast!) but isn't annotated with dwarf unwinding info (what do you mean you want to know what lead to your app crashing in OpenSSL?)...
I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?
If so, then it should be the default.
If it's a close call, then there should be 2 versions of the iso and repos.
As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.
The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?
So it's not just unsympathetic "F developers/services problems". I are one myself.
---
"people across the industry" is a meaningless and valueless term and is an empty argument.
> I don't understand why frame pointers need to be in by default instead of developers enabling where needed
If you enable frame pointers, you need to recompile every library your executable depends on. Otherwise, the unwind will fail at the first function that's not part of your executable. Usually library function calls (like glibc) are at the top of the stack, so for a large portion of the samples in a typical profile, you won't get any stack unwind at all.
In many (most?) cases recompiling all those libraries is just infeasible for the application developers, which is why the distro would need to do it. Developers can still choose whether to include frame pointers in their own applications (and so they can still pick up those 1-2% performance gains in their own code). But they're stuck with frame pointers enabled on all the distro provided code.
So the choice developers get to make is more along the lines of: should they use a distro with FP or without. Which is definitely not ideal, but that's life.
I have always had issues with the perf call trace sampling with frame pointers, even when virtually everything in userspace compiled with fno-omit-frame-pointer. It doesn't look like any of the failure modes listed in the article to me though. Shrug.
FYI, if you happen to be running on an intel cpu, --call-graph lbr uses some specicalized hardware and often delivers a far superior result, with some notable failure modes. Really looking forward to when AMD implements a similar feature.
The problem with Intel LBR (last branch records) is that the depth of the call stack is relatively limited. It depends on the generation of CPU, but LWN has a table here: https://lwn.net/Articles/680985/ Anything less than 32 is fairly useless for profiling from the kernel through to userspace.
Yeah, right. I still seem to get far more "comprehensible" traces when using it, even with this limitation. It's often really easy to localize where a trace is coming from, even when truncated. It probably breaks flamegraphs though.
I've tried --call-graph lbr a bunch of times, but often, it… just returns junk? I don't fully understand why, it sometimes returns wild pointers even if you don't have deep stacks.
I often get junk when sampling without lbr. Which kernel are you running? The quality of perf and the associated perf_events varies wildly across kernel versions.
The “function prologue is at least 8 bytes long” bit only applies if CET is used. If it is not used, the endbr64 instruction is not emitted and the prologue is only 4 bytes long.
"enabling frame pointers is a 1-2% performance loss, which translates to the loss of about 1 or 2 years of compiler improvements"
Wait, are we really that close to the maximum of what a compiler can optimize that we're getting barely 1% performance improvements per year with new versions?
As a part time compiler author I'm extremely skeptical we're getting a global 1–2%/yr. I'd've thought more like a tenth to half that? I've not seen any numbers, so I'm just making shit up.
However, for sure, if compiler optimizations disappeared, HW would pick up the slack in a few years.
There’s likely a lot of performance still on the table if compilers were permitted to change data structure layout, but I think doing this effectively is an open problem.
Current compilers could do a lot better with vectorization, but it will often be limited by the data structure layout.
Clearly this isn't the case. Plenty of neat C++ "reference implementation" code ends up 5x faster when hand optimized, parallelized, vectorized, etc.
There are some transformations that compilers are really bad at. Rearranging data structures, switching out algorithms for equivalent ones with better big-O complexity, generating & using lookup tables, bit-packing things, using caches, hash tables and bloom filters for time/memory trade offs, etc.
The spec doesn't prevent such optimizations, but current compilers aren't smart enough to find them.
Imagine the outcry if compilers switched algorithms. How can the compiler know my input size and input distribution? Maybe my dumb algorithm is optimal for my data.
Compilers can easily runtime-detect the size and shape of the problem, and run different code for different problem sizes. Many already do for loop-unrolling. Ie. if you memcpy 2 bytes, they won't even branch into the fancy SIMD version.
This would just be an extension of that. If the code creates and uses a linked list, yet the list is 1M items long and being accessed entirely by index, branch to a different version of the code which uses an array, etc.
That's my question. I'm also under the impression that optimizations CAN be made manyally, but I find it surprising that "current compilers aren't smart enought to find them" isn't improving
The JS constantly grabbing the anchor and updating it is absolutely appalling UX. It took me something like 11 back button presses to get back to where I was. Borderline malware.
Complaining about frame pointers is like complaining about the budget of the Bureau of Labor Statistics. Yes, it's pure overhead, but also yes, it's good to know what is going on.
This reads to me like FUD. Isn’t the fraction of profile samples in a prologue heavily workload dependent? And whichever way you go on frame pointers, there are winners and losers to including them by default.
There are no performance winners if you include them by default. There will be an additional >0% overhead when you are executing additional code in the prologue and epilogue, and increasing the register pressure by removing rbp from being ever allocated.
There are only "winners" in the sense that people will be able to more easily see why their never-tuned system is so slow. On the other hand, you're punishing all perf-critical usecases with unnecessary overhead.
I believe if you have a slow system, it's up to you to profile and optimize it, and that includes even recompiling some software with different flags to enable profiling. It's not the job of upstream to make this easier for you if it means punishing those workloads where teams have diligently profiled and optimized through the years so that there is no, as the author says, low-hanging fruit to find.
I’ve been around long enough to have had frame pointers pretty ubiquitously, then lost them, and now starting to have them again. The dark times in the middle were painful. For the software I’ve worked on, the easy dynamic profiling using frame pointers (eg using DTrace) has given far more in performance wins than omitting them would have. (Part of my beef with the article is that while edge cases do break some samples, in practice it’s a very small fraction, and almost by definition not the important ones if you’re trying to find heavy on-CPU code paths.)
I get that some use cases may be better without frame pointers. A well-resourced team can always recompile the world, whichever the default is. It’s just that my experience is that most software is not already perfectly tuned and I’d much rather the default be more easily observable.
Look, it's likely we just come from different backgrounds. Most of my perf-sensitive work was optimizing inner loops with SIMD, allowing the compiler to inline hot functions, creating better data structures to make use of the CPU cache, etc. Frame pointer prologue overhead was measurable on most of our use-cases. I have a smaller amount of experience on profiling systems where calls trace across multiple processes, so maybe I haven't felt this pain enough. Though I still think the onus should be on teams to be able to comfortably recompile---not the world---but some part of it. After all, a lot of tuning can only be done through compile flags, such as turning off codepaths/capabilities which are unnecessary.
My experience is on performance tuning the other side you mention. Cross-application, cross-library, whole-system, daemons, etc. Basically, "the whole OS as it's shipped to users".
For my case, I need the whole system setup correctly before it even starts to be useful. For your case, you only need the specific library or application compiled correctly. The rest of the system is negligible and probably not even used. Who would optimize SIMD routines next to function calls anyway?
I wasn't exaggerating about recompiling the world, though. Even if we say I'm only interested in profiling my application, a single library compiled without frame pointers makes useless any samples where code in that library was at the top of the stack. I've seen that be libc, openssl, some random Node module or JNI thing, etc. You can't just throw out those samples because they might still be your application's problem. For me in those situations, I would have needed to recompile most of the packages we got from both the OS distro and the supplemental package repo.
Didn’t really get the point of the post as it just presents something without a conclusion.
9X% of users do not care about a <1% drop in performance. I suspect we get the same variability just by going from one kernel version to another. The impact from all the Intel mitigations that are now enabled by default is much worse.
However I do care about nice profiles and stack traces without having to jump through hoops.
Asking people to recompile an _entire_ distribution just to get sane defaults is wrong. Those who care about the last drop should build their custom systems as they see fit, and they probably already do.
> 9X% of users do not care about a <1%
The same could be said about any accessibility issue or minority language translations
This comparison is pretty misleading. An accessibility issue prevents someone from being able to use software effectively. Not having localized text would have a similar impact. A ~1% performance impact on the other hand is the minuscule downside of improving debugging, profiling and error reporting for an entire OS. And that's not just a minority of users, as tons of software will automatically gather stack traces for bug reports.
There's basically no downside to fixing accessibility issues or adding new language translations other than the work involved in doing so. (And yes, maintaining translations over time is hard, but most projects let them lag during development, so they don't directly hold anything back.) There is a rather glaring downside to this performance optimization, whose upside is sometimes entirely within run-to-run variance and can be blown away by almost any other performance tweak. It's clear the optimization has some upsides, but an extra register and saving some trivial loads/stores just isn't as big of a deal on modern processors that are loaded to the gills with huge caches and deep pipelines.
I guess I don't care that much about fomit-frame-pointer in the grand scheme of things, but I think enabling it in distributions was ultimately a mistake. If some software packages benefited enough from it, it could've just been done only for those packages. Doing it across the system is questionable at best...
it does present a conclusion. once the kernel supports .sframe it will be all-around superior to -fomit-frame-pointer, and a better default for distros to use.
It does cause more memory pressure because the kernel will have to look at the user-space memory for decoding registers.
So yes it will be faster than alternatives to frame-pointers, but it still wont be as fast as frame pointers.
But does what you care about matter enough to be the default?
Are you the majority?
Evaluate "majority" this way: For every/any random binary in a distro, out of all the currently running instances of that binary in the world at any given moment, how many of those need to be profiled?
There is no way the answer is "most of them".
You have a job where you profile things, and maybe even you profile almost everything you touch. Your whole world has a high quotient of profiling in it. So you want the whole system built for profiling by default. How convenient for you. But your whole world is not the whole world.
But it's not just you, there are, zomg thousands, tens of thousands, maybe even hundreds of thousands of developers and ops admins the same as you.
Yes and? Is even that most installed instances of any given executable? No way.
Or maybe yes. It's possible. Can you show that somehow? But I will guess no way and not even close.
> Evaluate "majority" this way: For every/any random binary in a distro, out of all the currently running instances of that binary in the world at any given moment, how many of those need to be profiled? > There is no way the answer is "most of them".
This is an absurd way to evaluate it. All it takes is one savvy user to report a performance problem that developers are able to root-cause using stack traces from the user's system. Suppose they're able to make a 5% performance improvement to the program. Now all user's programs are 5% faster because of the frame pointers on this one user's system.
At this point people usually ask: but couldn't developers have done that on their own systems with debug code? But the performance of debug code is not the same as the performance of shipping code. And not all problems manifest the same on all systems. This is why you need shipping code to be debuggable (or instrumentable or profileable or whatever you want to call it).
I regularly have users run Sysprof and upload it to issues. It's immensely powerful to be able to see what is going on systems which are having issues. I'd argue it's one of the major reasons GNOME performance has gotten so much better in the recent-past.
You can't do that when step one is reinstall another distro and reproduce your problem.
Additionally, the overhead for performance related things that could fall into the 1% range (hint, it's not much) rarely are using the system libraries in such a way anyway that would cause this. They can compile that app with frame-pointers disabled. And for stuff where they do use system libraries (qsort, bsearch, strlen, etc) the frame pointer is negligible to the work being performed. You're margin of error is way larger than the theoretical overhead.
1% is a ton. 1% is crazy. Visa owns the world off just a 3% tax on everything else. Brokers make billions off of just 1% or even far less.
1% of all activity is only rational if you get more than 1% of all activity back out from those times and places where it was used.
1%, when it's of everything, is an absolutely stupendous collossal number that is absolutely crazy to try to treat as trivial.
Better analogy: you're paying 30% to apple, and over 50% in bad payday loans, and you're worried about the 3% visa/stripe overhead ... that's kinda crazy. But that's where we are in computer performance, there's 10x, 100x, and even greater inefficiencies everywhere, 1% for better backtraces is nothing.
Absolutely. We've gotten numerous double digit performance improvements across applications, libraries, and system daemons because of frame-pointers in Fedora (and that's just from me).
Performance problems matter to the people who have them, who often are in an inconvenient place. Having the ability for profiling to just work means that it's easy to help these people.
I think you are trying to make this out something that it isn’t.
Visibility at the “cost” of negligible impact is more important than raw performance. That’s it.
I’m a regular user of Linux with some performance sensitivity that does not go as far as “I _need_ that extra register!”. That’s what the majority of developers working on Linux are like. I think it’s up to _you_ to prove the contrary.
This seems like a ridiculous attempt to bury your head in the sand. Is there any evidence anyone doesn't want frame pointers?
I think it's ridiculous to question that since obviously, yes, many people have decided exactly that. I see no point myself and I'm even in the field. And I am not in charge of all the distributions which disabled it by default.
So, "yes". In fact "yes, duh?" Talk about head in sand...
Ok, where's the evidence?
> I see no point myself and I'm even in the field.
You don't see the point of readable stack traces?
Nope. Not on 99.999% of installed binaries in existence and running at a given moment.
That strikes me as an insane take (not to mention blatantly inaccurate), but I take your point that this is a common one for distribution-maintainers to have.
> 9X% of users do not care about a <1% drop in performance.
Except Python got opted out of the frame pointer change due to benchmarks showing slowdowns of up to 10%. The discussion around that had the great idea of just adding a pragma to flat out override the build setting. So in the end that "%1" reduction claim only holds if everything even remotely affected silently ignores the flag.
This is a bit of a mischaracterization of the Python side of things.
They only opted out for 3.11 which did not yet have the perf-integration fixes anyway. 3.12 uses frame-pointers just fine.
Any link to the fix or documentation about it? I could find added perf support but did not see anything about improved performance related to frame pointer use.
fifty independent 1% performance drops nobody cares about compound to a ~40% reduction.
I added support to Sysprof this weekend for unwinding using libdwfl and DWARF/CFI/eh_frame/etc techniques that Serhei did in eu-stacktrace.
The overhead is about 10% of samples. But at least you can unwind on systems without frame-pointers. Personally I'll take the statistical anomalies of frame-pointers which still allow you to know what PID/TID are your cost center even if you don't get perfect unwinds. Everyone seems motivated towards SFrame going forward, which is good.
https://blogs.gnome.org/chergert/2024/11/03/profiling-w-o-fr...
I broadly agree with the thesis of the post, which if I understand correctly is that frame pointers are a temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers, and it seems that we'll eventually get there.
Some of the statements in the post seem odd to me though.
- 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
- Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.
- Even if we're in the prologue, we know that we're in a leaf call, we can still resolve the instruction pointer to the function, and we can read the return address to find the parent, so what information is lost?
When it comes to future alternatives, while frame pointers have their own problems, I think that there are still a few open questions:
- Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
- Is the memory overhead of lookup tables for very large programs acceptable?
> - Is using the whole 8 bytes right for the estimate? Pushing the stack pointer is the first instruction in the prologue and it's literally 1 byte. Epilogue is symmetrical.
I believe it's because of the landing pad for Control Flow Integrity which basically all functions now need. Grabbing main() from a random program on Fedora (which uses frame pointers):
It's not much of an issue in practice as the stack trace will still be nearly correct, enough for you to identify the problematic area of the code.> - Shadow stacks are cool but aren't they limited to a fixed number of entries? What if you have a deeper stack?
Yes shadow stacks are limited to 32 entries on the most recent Intel CPUs (and as little as 4 entries on very old ones). However they are basically cost free so that's a big advantage.
I think SFrame is a sensible middle ground here. It's saner than DWARF and has a long history of use in the kernel so we know it will work.
If you're limited to 32 entries, why not just use LBR, then? It has basically the same pros and cons.
> - 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)
This sounds to me more like an inlining problem than an ABI problem. If the calls take as much time than the running, perhaps you just need a better language that doesn’t arbitrarily prevent inlining due to compilation boundaries (eg. basically any modern language that isn’t in the C/C++ family, before LTO)
I see this in LTO/PGO binaries as well. If a function is 20 instructions long, it's not like you can inline it uncritically, yet a five-cycle prologue and a five-cycle epilogue will hurt. (Also, recursive functions etc.)
> Shadow stacks are cool but aren't they limited to a fixed number of entries?
Current available hardware yes. But I think some of the future Intel stuff was going to allow for much larger depth.
> Is the memory overhead of lookup tables for very large programs acceptable?
I don't think SFrame is as "dense" as DWARF as a format so you trade a bit of memory size for a much faster unwind experience. But you are definitely right that this adds memory pressure that could otherwise be ignored.
Especially if the anomalies are what they sound like, just account for them statistically. You get a PID for cost accounting in the perf_event frame anyway.
> temporary compromise until the whole ecosystem gets its act together and manages to agree on some form of out-of-band tracking of frame pointers,
Temporary solutions have a way of becoming permanent. I was against the recent frame pointer enablement on the grounds of moral hazard. I still think it would have been better to force the ecosystem to get its act together first.
Another factor nobody is talking about is JITed and interpreted languages. Whatever the long-term solution might be, it should enable stack traces that interleave accurate source-level frame information from native and managed code. The existing perf /tmp file hack is inadequate in many ways, including security, performance, and compatibility with multiple language runtimes coexisting in a single process.
It's a disaster no doubt.
But, at least from the GNOME side of things, we've been complaining about it for roughly 15 years and kept getting push-back in the form of "we'll make something better".
Now that we have frame-pointers enabled in Fedora, Ubuntu, Arch, etc we're starting to see movement on realistic alternatives. So in many ways, I think the moral hazard was waiting until 2023 to enable them.
Is this a response to Alma Kitten?
In any event I don't understand why frame pointers need to be in by default instead of developers enabling where needed.
Having Kitten include pointers by default seems reasonable enough, since Kitten is a devel system.
In reality all this discussion has its origin in a design mistake made by Intel already in the 8086 CPU, when it was launched in 1978.
They have designed the instruction set in such a way that two distinct registers were necessary for fulfilling the roles of the stack pointer and of the frame pointer.
In better designed instruction sets, for example in IBM POWER, a single register is enough for fulfilling both roles, simultaneously being both stack pointer and frame pointer.
Unfortunately, the Intel designers have not thought at all about this problem, but in 1978 they have just followed the example of the architectures popular at that time, e.g. DEC VAX, which had also made the same mistake of reserving two distinct registers for the roles of stack pointer and of frame pointer.
In the architectures where a single register plays both roles, the stack pointer always points to a valid stack frame that is a part of a linked list of all stack frames. For this to work, there must be an atomic instruction for both creating a new stack frame (which consists in storing the old frame pointer in the right place of the new stack frame) and updating the stack pointer to point to the new stack frame. The Intel/AMD ISA does not have such an atomic instruction, and this is the reason why two registers are needed for creating a new stack frame in a safe way (safe means that the frame pointer always points to a valid stack frame and the stack pointer always points to the top of stack).
The real benefit is being able to turn on profiling when a problem is spotted, or in some cases to be able to profile continuously in production (as apparently they do at Netflix).
I get it. This frustrated me to no end. But still I did what I had to do --- recompiled random software throughout the stack, enabled random flags, etc. It was doable and now I can do it much faster. I don't think it's fair for upstream to disable a useful optimization just so I don't have to do this additional work to fix and optimize my system.
Doing real world, whole system profiling, we've found performance was affected by completely unexpected software running on the system. Recompiling the entire distribution, or even the subset of all software installed, is not realistic for most people. Besides, I have measured the overhead of frame pointers and it's less than 1%, so there's not really any trade-off here.
Anyway, soon we'll have SFrame support in the userspace tools and the whole issue will go away.
In one of my jobs, a 1% perf regression (on a more stable/reproducible system, not PCs) was a reason for a customer raising a ticket, and we'd have to look into it. For dynamically dispatched but short functions, the overhead is easily more than 1% too. So, there is a trade-off, just not one that affects you.
If 1% shows up out of nowhere, it's very much worth investigation and trying to fix it. You shouldn't let them freely happen and pile up.
But there are some 1% costs that are worth it.
I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?
If so, then it should be the default.
If it's a close call, then there should be 2 versions of the iso and repos.
As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.
The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?
So it's not just unsympathetic "F developers/services problems". I are one myself.
Everyone benefits from the net performance wins that come from an ecosystem where everyone can easily profile things. I have no doubt that works out to more than a 1% lifetime improvement. Same reason you log stuff on your servers. 99.9% pure overhead, never even seen by a human. Slows stuff down, even causes uptime issues sometimes from bugs or full discs. It's still worthwhile though because occasionally it makes fixes or enhancements possible that are so much larger than the cost of the observability.
Do 50% of users need to be able to:
* modify system services?
* run a compiler?
* add custom package repositories?
* change the default shell?
I believe the answer to all of the above is "no".
All those things are free in terms of performance though.
I don't see how this applies. Some shell has to be the default one, and all systems don't pick the same one even. Most systems don't install a compiler by default. Thank you for making my point?
All these things are possible to do, even though only developers need them. Why shouldn’t the same be true for useful profiling abilities? Because of the 1-2% penalty?
Are you serious?
Visa makes billions per year off of nothing but collecting a mere 2%-3% tax on everything else.
I don't see how Visa is in any way relevant here.
I don't see why not.
The whole point of an analogy is to expose a blind spot by showing the same thing in some other context where it is recognized or percieved differently.
Meta also continuously profiles in production, FWIW.
Then Netflix can enable it for their systems? Are they actually still profiling cat and ls that come from the os or are they profiling their own applications and the interpreters and daemons they run on?
This does not explain why a distribution should have such a feature on by default. It only explains why Netflix wants it on some of their systems.
People across the industry are suffering from incomplete stacktraces because their applications call into libraries like glibc or OpenSSL that has frame pointer optimization enabled by their distro. It's pretty ridiculous to have to pull off a Linux from Scratch on CentOS just to get a decent stacktrace. Needless to say, this has nothing at all to do with profiling cat and ls.
OpenSSL is the worst because some configurations execute asm generated by a specialised program. That code clobbers the frame pointer (gotta go fast!) but isn't annotated with dwarf unwinding info (what do you mean you want to know what lead to your app crashing in OpenSSL?)...
> Then Netflix can enable it for their systems?
And they did.
The question is though why only Netflix should benefit from that. It takes a lot of effort to recompile an entire Linux distribution.
Quoting my other comment in this thread:
---
I think it comes down to numbers. What are most installed systems used for? Do more than 50% of installed systems need to be doing this profiling all the time on just all binaries such that they just need to be already built this way without having to identify them and prepare them ahead of time?
If so, then it should be the default.
If it's a close call, then there should be 2 versions of the iso and repos.
As many developers and service operators as there are, as much as everyone on this page is including both you and I, I still do not believe the profiling use case is the majority use case.
The way I am trying to judge "majority" is: Pick a binary at random from a distribution. Now imagine all running instances of that binary everywhere. How many of those instances need to be profiled? Is it really most of them?
So it's not just unsympathetic "F developers/services problems". I are one myself.
---
"people across the industry" is a meaningless and valueless term and is an empty argument.
> such that they just need to be already built this way without having to identify them and prepare them ahead of time
I think a big enough fraction of potentially-useful crash reports come from those systems to make it a good default.
> I don't understand why frame pointers need to be in by default instead of developers enabling where needed
If you enable frame pointers, you need to recompile every library your executable depends on. Otherwise, the unwind will fail at the first function that's not part of your executable. Usually library function calls (like glibc) are at the top of the stack, so for a large portion of the samples in a typical profile, you won't get any stack unwind at all.
In many (most?) cases recompiling all those libraries is just infeasible for the application developers, which is why the distro would need to do it. Developers can still choose whether to include frame pointers in their own applications (and so they can still pick up those 1-2% performance gains in their own code). But they're stuck with frame pointers enabled on all the distro provided code.
So the choice developers get to make is more along the lines of: should they use a distro with FP or without. Which is definitely not ideal, but that's life.
It's useful to be able to profile on production workloads
I have always had issues with the perf call trace sampling with frame pointers, even when virtually everything in userspace compiled with fno-omit-frame-pointer. It doesn't look like any of the failure modes listed in the article to me though. Shrug.
FYI, if you happen to be running on an intel cpu, --call-graph lbr uses some specicalized hardware and often delivers a far superior result, with some notable failure modes. Really looking forward to when AMD implements a similar feature.
The problem with Intel LBR (last branch records) is that the depth of the call stack is relatively limited. It depends on the generation of CPU, but LWN has a table here: https://lwn.net/Articles/680985/ Anything less than 32 is fairly useless for profiling from the kernel through to userspace.
Yeah, right. I still seem to get far more "comprehensible" traces when using it, even with this limitation. It's often really easy to localize where a trace is coming from, even when truncated. It probably breaks flamegraphs though.
I've tried --call-graph lbr a bunch of times, but often, it… just returns junk? I don't fully understand why, it sometimes returns wild pointers even if you don't have deep stacks.
I often get junk when sampling without lbr. Which kernel are you running? The quality of perf and the associated perf_events varies wildly across kernel versions.
A variety of kernels over the last five years, on a multitude of Intel CPUs. :-) I last tested this on 6.10, I think.
It's certainly true that there can be junk in --call-graph fp, too.
The “function prologue is at least 8 bytes long” bit only applies if CET is used. If it is not used, the endbr64 instruction is not emitted and the prologue is only 4 bytes long.
"enabling frame pointers is a 1-2% performance loss, which translates to the loss of about 1 or 2 years of compiler improvements"
Wait, are we really that close to the maximum of what a compiler can optimize that we're getting barely 1% performance improvements per year with new versions?
As a part time compiler author I'm extremely skeptical we're getting a global 1–2%/yr. I'd've thought more like a tenth to half that? I've not seen any numbers, so I'm just making shit up.
However, for sure, if compiler optimizations disappeared, HW would pick up the slack in a few years.
There’s likely a lot of performance still on the table if compilers were permitted to change data structure layout, but I think doing this effectively is an open problem.
Current compilers could do a lot better with vectorization, but it will often be limited by the data structure layout.
Proebsting’s Law suggests 4% per year, but as a satirical joke it seems to have underdone its cynicism.
https://gwern.net/doc/cs/algorithm/2001-scott.pdf
Yeah, compilers are already pretty close to the limit of what is possible, unless your code is unusually poorly written.
Clearly this isn't the case. Plenty of neat C++ "reference implementation" code ends up 5x faster when hand optimized, parallelized, vectorized, etc.
There are some transformations that compilers are really bad at. Rearranging data structures, switching out algorithms for equivalent ones with better big-O complexity, generating & using lookup tables, bit-packing things, using caches, hash tables and bloom filters for time/memory trade offs, etc.
The spec doesn't prevent such optimizations, but current compilers aren't smart enough to find them.
Imagine the outcry if compilers switched algorithms. How can the compiler know my input size and input distribution? Maybe my dumb algorithm is optimal for my data.
Compilers can easily runtime-detect the size and shape of the problem, and run different code for different problem sizes. Many already do for loop-unrolling. Ie. if you memcpy 2 bytes, they won't even branch into the fancy SIMD version.
This would just be an extension of that. If the code creates and uses a linked list, yet the list is 1M items long and being accessed entirely by index, branch to a different version of the code which uses an array, etc.
That's my question. I'm also under the impression that optimizations CAN be made manyally, but I find it surprising that "current compilers aren't smart enought to find them" isn't improving
The percentage of all software engineers working on compilers is probably lower now than it ever has been...
The JS constantly grabbing the anchor and updating it is absolutely appalling UX. It took me something like 11 back button presses to get back to where I was. Borderline malware.
Complaining about frame pointers is like complaining about the budget of the Bureau of Labor Statistics. Yes, it's pure overhead, but also yes, it's good to know what is going on.
This reads to me like FUD. Isn’t the fraction of profile samples in a prologue heavily workload dependent? And whichever way you go on frame pointers, there are winners and losers to including them by default.
There are no performance winners if you include them by default. There will be an additional >0% overhead when you are executing additional code in the prologue and epilogue, and increasing the register pressure by removing rbp from being ever allocated.
There are only "winners" in the sense that people will be able to more easily see why their never-tuned system is so slow. On the other hand, you're punishing all perf-critical usecases with unnecessary overhead.
I believe if you have a slow system, it's up to you to profile and optimize it, and that includes even recompiling some software with different flags to enable profiling. It's not the job of upstream to make this easier for you if it means punishing those workloads where teams have diligently profiled and optimized through the years so that there is no, as the author says, low-hanging fruit to find.
I’ve been around long enough to have had frame pointers pretty ubiquitously, then lost them, and now starting to have them again. The dark times in the middle were painful. For the software I’ve worked on, the easy dynamic profiling using frame pointers (eg using DTrace) has given far more in performance wins than omitting them would have. (Part of my beef with the article is that while edge cases do break some samples, in practice it’s a very small fraction, and almost by definition not the important ones if you’re trying to find heavy on-CPU code paths.)
I get that some use cases may be better without frame pointers. A well-resourced team can always recompile the world, whichever the default is. It’s just that my experience is that most software is not already perfectly tuned and I’d much rather the default be more easily observable.
Look, it's likely we just come from different backgrounds. Most of my perf-sensitive work was optimizing inner loops with SIMD, allowing the compiler to inline hot functions, creating better data structures to make use of the CPU cache, etc. Frame pointer prologue overhead was measurable on most of our use-cases. I have a smaller amount of experience on profiling systems where calls trace across multiple processes, so maybe I haven't felt this pain enough. Though I still think the onus should be on teams to be able to comfortably recompile---not the world---but some part of it. After all, a lot of tuning can only be done through compile flags, such as turning off codepaths/capabilities which are unnecessary.
I think your viewpoint is valid.
My experience is on performance tuning the other side you mention. Cross-application, cross-library, whole-system, daemons, etc. Basically, "the whole OS as it's shipped to users".
For my case, I need the whole system setup correctly before it even starts to be useful. For your case, you only need the specific library or application compiled correctly. The rest of the system is negligible and probably not even used. Who would optimize SIMD routines next to function calls anyway?
Makes sense.
I wasn't exaggerating about recompiling the world, though. Even if we say I'm only interested in profiling my application, a single library compiled without frame pointers makes useless any samples where code in that library was at the top of the stack. I've seen that be libc, openssl, some random Node module or JNI thing, etc. You can't just throw out those samples because they might still be your application's problem. For me in those situations, I would have needed to recompile most of the packages we got from both the OS distro and the supplemental package repo.