Great analysis of unwinding overhead in Rust. The framing of exceptions as "alternate returns" is enlightening - they should be cheap in theory, which makes the current ~2.3μs overhead particularly interesting to dissect. The optimization journey from removing unnecessary type erasure to using thread-locals is well explained. While the 4.3x speedup is impressive, I think the bigger takeaway isn't about replacing Result with panics, but rather identifying where non-local control flow genuinely makes sense. Looking forward to the deep dive into unwinder implementations.
but it only works if you don't need to run the defers/Drop's/destructors/etc. for stuff that's on the stack between the current frame and the handler's frame. Which you do, most of the time.
> it only works if you don't need to run the defers/Drop's/destructors/etc
Indeed. And the per frame cleanup is also language agnostic which adds overhead; it also must support both sjlj and dwarf frames[1]; it is also done in two phases: destructors are only run if an actual catch is found: an unhandled exception doesn't run destructors to preserve state in the core file. This requires a two-phase unwinding that again slows things down.
Another big bottleneck that might not be captured in OP test is that the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose, which greatly limit scalability of exception handling.
Still very nice improvements from OP.
[1] although I'm not sure you can mix them in the same program or it is a platform-wide decision.
Also, I don't think it's possible to perform any kind of backwards-compatible static analysis that would tell you when it's save to just JMP. Unless you have full information, perhaps (at the very least everything already specialized).
> Returning to an alternate address shouldn’t be significantly more expensive than returning to the default address, so this has to be cheap.
Modern CPUs add complications to arguments like this. Branches stall the execution pipeline, so branch prediction was invented to keep the pipeline flowing. Return instructions are perfectly predicted, which makes them literally free. At the very least, any alternate return scheme has to pay for a full misprediction. That can be expensive.
This doesn't really make sense. The branch predictor relies on a history of previous executions of that branch, or on explicit hints, to decide if a branch will be taken or not. Based on this prediction, the speculative execution hardware then sees the jump (return/panic) and loads the code from that address into the icache. There is 0 difference between `if (condition) jump $panic_recover_address` and `if (condition) jump $function_return_address` in terms of how easy or hard it is to predict or speculatively load based on the prediction.
On x86, ret and call are explicit instructions. Ret always predicts the address of the last call, which is (usually) 100% accurate. Your example of `if (condition) jump $panic_recover_address` contains two branches, either of which can be mispredicted.
Intel (and most CPUs really) have had a "shadow" call stack for ret+call prediction (the Return Stack Buffer) decades before they had the control-flow integrity shadow stack. It is possible that the two stacks are now unified, but I'm not 100% sure.
The RSB has 10 or so entries and it is in the critical path, while the integrity stack might be larger and have less strict performance characteristics, so they might be separate objects.
This is a good point that I hadn't considered - these are indirect jumps, and the return instruction has special handling from the processor to compute the address specifically, which a jump to a recover address can't have.
returns, conditional jumps and indirect jumps have each fairly different prediction profiles. In particular paired call+ret are predicted perfectly at least until the dedicated internal stack doesn't overflow; indirect jumps are, as a general rule, less predictable than conditional jumps as more state (the jump target) need to stored.
Great analysis of unwinding overhead in Rust. The framing of exceptions as "alternate returns" is enlightening - they should be cheap in theory, which makes the current ~2.3μs overhead particularly interesting to dissect. The optimization journey from removing unnecessary type erasure to using thread-locals is well explained. While the 4.3x speedup is impressive, I think the bigger takeaway isn't about replacing Result with panics, but rather identifying where non-local control flow genuinely makes sense. Looking forward to the deep dive into unwinder implementations.
Well, unwinding can be as simple (and as fast) as
but it only works if you don't need to run the defers/Drop's/destructors/etc. for stuff that's on the stack between the current frame and the handler's frame. Which you do, most of the time.> it only works if you don't need to run the defers/Drop's/destructors/etc
Indeed. And the per frame cleanup is also language agnostic which adds overhead; it also must support both sjlj and dwarf frames[1]; it is also done in two phases: destructors are only run if an actual catch is found: an unhandled exception doesn't run destructors to preserve state in the core file. This requires a two-phase unwinding that again slows things down.
Another big bottleneck that might not be captured in OP test is that the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose, which greatly limit scalability of exception handling.
Still very nice improvements from OP.
[1] although I'm not sure you can mix them in the same program or it is a platform-wide decision.
Also, I don't think it's possible to perform any kind of backwards-compatible static analysis that would tell you when it's save to just JMP. Unless you have full information, perhaps (at the very least everything already specialized).
> Returning to an alternate address shouldn’t be significantly more expensive than returning to the default address, so this has to be cheap.
Modern CPUs add complications to arguments like this. Branches stall the execution pipeline, so branch prediction was invented to keep the pipeline flowing. Return instructions are perfectly predicted, which makes them literally free. At the very least, any alternate return scheme has to pay for a full misprediction. That can be expensive.
>Return instructions are perfectly predicted
As long as you don't overwrite the return address on the stack.
This doesn't really make sense. The branch predictor relies on a history of previous executions of that branch, or on explicit hints, to decide if a branch will be taken or not. Based on this prediction, the speculative execution hardware then sees the jump (return/panic) and loads the code from that address into the icache. There is 0 difference between `if (condition) jump $panic_recover_address` and `if (condition) jump $function_return_address` in terms of how easy or hard it is to predict or speculatively load based on the prediction.
On x86, ret and call are explicit instructions. Ret always predicts the address of the last call, which is (usually) 100% accurate. Your example of `if (condition) jump $panic_recover_address` contains two branches, either of which can be mispredicted.
Sine Intel processors have shadow (call-)stack to ensure control-flow integrity, I imagine they use it to predict the return address as well.
Intel (and most CPUs really) have had a "shadow" call stack for ret+call prediction (the Return Stack Buffer) decades before they had the control-flow integrity shadow stack. It is possible that the two stacks are now unified, but I'm not 100% sure.
The RSB has 10 or so entries and it is in the critical path, while the integrity stack might be larger and have less strict performance characteristics, so they might be separate objects.
No, an unconditional jump to a fixed address can't be mispredicted.
How can a reusable function have a fixed return address?
This is a good point that I hadn't considered - these are indirect jumps, and the return instruction has special handling from the processor to compute the address specifically, which a jump to a recover address can't have.
Because of the stack? It is not fixed, but once you a function is called, the CPU must know where it was called from.
returns, conditional jumps and indirect jumps have each fairly different prediction profiles. In particular paired call+ret are predicted perfectly at least until the dedicated internal stack doesn't overflow; indirect jumps are, as a general rule, less predictable than conditional jumps as more state (the jump target) need to stored.