Bringing Faster Exceptions to Rust

(purplesyringa.moe)

53 points | by stpn 9 hours ago ago

15 comments

  • weinzierl 3 hours ago

    Great analysis of unwinding overhead in Rust. The framing of exceptions as "alternate returns" is enlightening - they should be cheap in theory, which makes the current ~2.3μs overhead particularly interesting to dissect. The optimization journey from removing unnecessary type erasure to using thread-locals is well explained. While the 4.3x speedup is impressive, I think the bigger takeaway isn't about replacing Result with panics, but rather identifying where non-local control flow genuinely makes sense. Looking forward to the deep dive into unwinder implementations.

  • Joker_vD 2 hours ago

    Well, unwinding can be as simple (and as fast) as

        MOV  SP, [installed_handler]
        JMP  [installed_handler+WORD]
    
    but it only works if you don't need to run the defers/Drop's/destructors/etc. for stuff that's on the stack between the current frame and the handler's frame. Which you do, most of the time.
    • gpderetta an hour ago

      > it only works if you don't need to run the defers/Drop's/destructors/etc

      Indeed. And the per frame cleanup is also language agnostic which adds overhead; it also must support both sjlj and dwarf frames[1]; it is also done in two phases: destructors are only run if an actual catch is found: an unhandled exception doesn't run destructors to preserve state in the core file. This requires a two-phase unwinding that again slows things down.

      Another big bottleneck that might not be captured in OP test is that the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose, which greatly limit scalability of exception handling.

      Still very nice improvements from OP.

      [1] although I'm not sure you can mix them in the same program or it is a platform-wide decision.

    • Yoric an hour ago

      Also, I don't think it's possible to perform any kind of backwards-compatible static analysis that would tell you when it's save to just JMP. Unless you have full information, perhaps (at the very least everything already specialized).

  • codeflo 3 hours ago

    > Returning to an alternate address shouldn’t be significantly more expensive than returning to the default address, so this has to be cheap.

    Modern CPUs add complications to arguments like this. Branches stall the execution pipeline, so branch prediction was invented to keep the pipeline flowing. Return instructions are perfectly predicted, which makes them literally free. At the very least, any alternate return scheme has to pay for a full misprediction. That can be expensive.

    • meindnoch 26 minutes ago

      >Return instructions are perfectly predicted

      As long as you don't overwrite the return address on the stack.

    • tsimionescu 3 hours ago

      This doesn't really make sense. The branch predictor relies on a history of previous executions of that branch, or on explicit hints, to decide if a branch will be taken or not. Based on this prediction, the speculative execution hardware then sees the jump (return/panic) and loads the code from that address into the icache. There is 0 difference between `if (condition) jump $panic_recover_address` and `if (condition) jump $function_return_address` in terms of how easy or hard it is to predict or speculatively load based on the prediction.

      • codeflo 2 hours ago

        On x86, ret and call are explicit instructions. Ret always predicts the address of the last call, which is (usually) 100% accurate. Your example of `if (condition) jump $panic_recover_address` contains two branches, either of which can be mispredicted.

        • Joker_vD 2 hours ago

          Sine Intel processors have shadow (call-)stack to ensure control-flow integrity, I imagine they use it to predict the return address as well.

          • gpderetta an hour ago

            Intel (and most CPUs really) have had a "shadow" call stack for ret+call prediction (the Return Stack Buffer) decades before they had the control-flow integrity shadow stack. It is possible that the two stacks are now unified, but I'm not 100% sure.

            The RSB has 10 or so entries and it is in the critical path, while the integrity stack might be larger and have less strict performance characteristics, so they might be separate objects.

        • tsimionescu 2 hours ago

          No, an unconditional jump to a fixed address can't be mispredicted.

          • codeflo 2 hours ago

            How can a reusable function have a fixed return address?

            • tsimionescu an hour ago

              This is a good point that I hadn't considered - these are indirect jumps, and the return instruction has special handling from the processor to compute the address specifically, which a jump to a recover address can't have.

            • eptcyka 2 hours ago

              Because of the stack? It is not fixed, but once you a function is called, the CPU must know where it was called from.

      • gpderetta an hour ago

        returns, conditional jumps and indirect jumps have each fairly different prediction profiles. In particular paired call+ret are predicted perfectly at least until the dedicated internal stack doesn't overflow; indirect jumps are, as a general rule, less predictable than conditional jumps as more state (the jump target) need to stored.