Bringing Faster Exceptions to Rust

(purplesyringa.moe)

103 points | by stpn 8 months ago ago

34 comments

Great analysis of unwinding overhead in Rust. The framing of exceptions as "alternate returns" is enlightening - they should be cheap in theory, which makes the current ~2.3μs overhead particularly interesting to dissect. The optimization journey from removing unnecessary type erasure to using thread-locals is well explained. While the 4.3x speedup is impressive, I think the bigger takeaway isn't about replacing Result with panics, but rather identifying where non-local control flow genuinely makes sense. Looking forward to the deep dive into unwinder implementations.

[-]

vlovich123 8 months ago

Should be cheaper but the expensive part not discussed has to be snapshotting the stack which feels expensive and that’s what the panic information is supposed to preserve. That’s why they got an extra 5x performance improvement but not 10 or 100x and didn’t provide a benchmark of how frequently you could snapshot the stack. Indeed, by using a simple microbenchmark we don’t see the measurement of this improvement when the stacks are ~20x frames deep - do the same hotspots show up or does capturing the stack start to dominate?

And one thing that could be needed is the ability to throw within a catch and if you do that you can corrupt the TLS (ie memory safety) unless you’re careful and follow the guidelines. In other words you personally can have written 100% safe code that is not memory sound unless you follow the high level rules - this is closer to a C API than anything that would be “allowed” as a traditional rust api where the guarantee of a safe API is that no unsoundness can happen no matter how you hold it. That’s a lot of safety to sacrifice for something tried and true. Use it if you really need it but I think following the advice that error states should be rare in the first place is probably better - return an error for any failable operation and panic on unwind. Trying to catch unwind panics is a landmine approach of trying to get things to work and I know from experience having tried that approach. It doesn’t play with things like async too. And then you have to bubble them up across threads?

This approach would fail there. These aren’t unfixable design flaws thankfully. You’d need a sum type to have the underlying memory to be detachable to the heap and somehow guarantee it’s always detached safely and soundly before overwriting (eg having a counter in the TLS header that is copied to the struct being unwound to guarantee that the TLS values you think you are accessing indeed has not been overwritten or having a TLS pointer to the stack value containing the unwound value somehow be written through to detach whenever someone doesn’t call the right catch mechanism). So I think this work is super valuable and the ideas should be refined and mainlined because inefficiencies like this aren’t great but simultaneously no one should be writing error handling by catching unwraps except for very very limited situations that you can clearly articulate as necessary for the goal you are trying to achieve. Like I spawned a background thread but if the computation fails I can report the failure gracefully to the human operator of the machine in a non debug context. But in those cases you want to be a supervisor forked process that is responsible for process death only rather than compiling it all into one binary. I wish Rust made that part easier. Ie start the process in a different mode but then switch to panic so that you carry the performance gains (ie this crate should be built using optimized panic with unwind but then this other crate is with a different unwind mode and you could spawn the other crate through a guaranteed fork to fix the soundness potential and the panic information is serialized across the wire to the patent process via a private pipe and unwound that way). That would provide an easier API to indicate more clearly the delegation of responsibility you should have been catching unwind and how to structure your code operationally. However it can’t be the only way because you might have something like an http framework. And you want to “guarantee” that you deliver an HTTP response to a socket and log metrics before crashing and you want the next request to be handled immediately with minimal additional CPU work. You can’t just do it across a fork barrier because that’s an expensive thing to dispatch to a new thread in the happy path + you have a thread poll you need to keep healthy and alive to maintain in tokio - can’t fork or spawn a thread on every new inbound connection. So there are cases where you want to catch unwind which is to have consistent behavior in a framework even if the user’s code or our code has a bug (but in those cases you might probably use the panic method during debug builds to notice failures like that before your release to production where you prevent bugs from manifesting as a mechanism/gadget for attackers to DOS your service)

Joker_vD 8 months ago

Well, unwinding can be as simple (and as fast) as

    MOV  SP, [installed_handler]
    JMP  [installed_handler+WORD]

but it only works if you don't need to run the defers/Drop's/destructors/etc. for stuff that's on the stack between the current frame and the handler's frame. Which you do, most of the time.

[-]

gpderetta 8 months ago

> it only works if you don't need to run the defers/Drop's/destructors/etc

Indeed. And the per frame cleanup is also language agnostic which adds overhead; it also must support both sjlj and dwarf frames[1]; it is also done in two phases: destructors are only run if an actual catch is found: an unhandled exception doesn't run destructors to preserve state in the core file. This requires a two-phase unwinding that again slows things down.

Another big bottleneck that might not be captured in OP test is that the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose, which greatly limit scalability of exception handling.

Still very nice improvements from OP.

[1] although I'm not sure you can mix them in the same program or it is a platform-wide decision.

[-]

Joker_vD 8 months ago

> the unwinder has to take (or used to, things got better recently) a global lock to prevent races with dlclose

If someone from another thread decided to unload a library whose code is still being executed in this thread then this thread would normally crash anyhow, and do so irrecoverably, right?

[-]

gpderetta 8 months ago

well yes, but I think the mutex also protects some other global datastructures.

I think the mutex was mostly removed recently, by observing exactly what you mention: any race would be UB anyway.

Yoric 8 months ago

Also, I don't think it's possible to perform any kind of backwards-compatible static analysis that would tell you when it's safe to just JMP. Unless you have full information, perhaps (at the very least everything already specialized).

lesuorac 8 months ago

This isn't benchmarked against returning a Result right?

Like wouldn't bypassing any unwinding be faster than improving unwinding. You already seem to have control over the code as the thrown and caught exception have to be the same so might as well just write the code as a `Result<T, Exception>` in the first place no?

[-]

ctz 8 months ago

The topic is how to make non-local returns faster. I would suggest that normal returns are massively faster, and not returning at all is faster still (zero instructions needed!) -- but neither of these are non-local returns.

codeflo 8 months ago

> Returning to an alternate address shouldn’t be significantly more expensive than returning to the default address, so this has to be cheap.

Modern CPUs add complications to arguments like this. Branches stall the execution pipeline, so branch prediction was invented to keep the pipeline flowing. Return instructions are perfectly predicted, which makes them literally free. At the very least, any alternate return scheme has to pay for a full misprediction. That can be expensive.

[-]

tsimionescu 8 months ago

This doesn't really make sense. The branch predictor relies on a history of previous executions of that branch, or on explicit hints, to decide if a branch will be taken or not. Based on this prediction, the speculative execution hardware then sees the jump (return/panic) and loads the code from that address into the icache. There is 0 difference between `if (condition) jump $panic_recover_address` and `if (condition) jump $function_return_address` in terms of how easy or hard it is to predict or speculatively load based on the prediction.

[-]

beng-nl 8 months ago

Not that you’re wrong, but Returns aren’t predicted using the branch predictor, but with the RSB (return stack buffer) which stores the return addresses of the current call stack. The x86 optimization manual (starting quite a few years ago) explicitly mentions calls and rets should match for best performance.

codeflo 8 months ago

On x86, ret and call are explicit instructions. Ret always predicts the address of the last call, which is (usually) 100% accurate. Your example of `if (condition) jump $panic_recover_address` contains two branches, either of which can be mispredicted.

[-]

Joker_vD 8 months ago

Sine Intel processors have shadow (call-)stack to ensure control-flow integrity, I imagine they use it to predict the return address as well.

[-]

gpderetta 8 months ago

Intel (and most CPUs really) have had a "shadow" call stack for ret+call prediction (the Return Stack Buffer) decades before they had the control-flow integrity shadow stack. It is possible that the two stacks are now unified, but I'm not 100% sure.

The RSB has 10 or so entries and it is in the critical path, while the integrity stack might be larger and have less strict performance characteristics, so they might be separate objects.

tsimionescu 8 months ago

No, an unconditional jump to a fixed address can't be mispredicted.

[-]

bjackman 8 months ago

Yes it can, prediction begins before decode. In general, literally any instruction can be mispredicted, even if it isn't a branch at all, even if there isn't even a instruction there (on x86 where instructions are variable length).

[-]

ithkuil 8 months ago

For the uninitiated, branch prediction roughly works like this:

The CPU fetches instructions from memory well ahead of actually decoding what the instructions are. In case of variable length instruction sets such as x86 that also means the cpu has no ability to "peek ahead" in the instruction stream to find out if there is a branch.

But don't despair, there is a trick:

Each instruction (obviously) has an address. So if you had an associative memory (think of it as a hash map) that stored a pair of (address of a branch; target address) then you can consult this memory while you're fetching instructions to feed to the decoder stage of the pipeline.

When the address of the next instruction you're about to fetch is found in that associative memory you get the address of where the rest of the instruction stream lives. I.e. instead of fetching the next word sequentially you continue fetching from whatever address was found by the lookup.

Now, when you actually end up executing the instructions it may turn out that the target address suggested by that lookup memory was wrong. In that case you just flush the pipeline and start fetching again from the actual target (and you update the branch predictor associative memory).

This basic model works for conditional branches, unconditional and indirect branches too but in practice there are more tricks. Some indirect jumps like returns exploit the natural call/ret pairing as described elsewhere in this thread. Conditional branch entries may contain an extra bit taken/not-taken etc.

But the main idea is more or less this.

To "mispredict" an unconditional jump for example all it takes is to modify the code so that the instruction points to a different target.

If the branch predictor target address still points to the old target, that address will be prefetchec and a stall will be caused. No big deal in practice.

[-]

hu3 8 months ago

Your explanation is amazing. Thanks

How would this happen in practice?

> To "mispredict" an unconditional jump for example all it takes is to modify the code so that the instruction points to a different target.

Perhaps a jump to a pointer that changed value? Or maybe JIT code that was optimized during runtime?

[-]

ithkuil 8 months ago

Jumping to a destination via pointer that changed value is a misprediction of an indirect jump and that's common.

More uncommon but technically possible is to mispredict a unconditional direct jump.

For that to happen the code itself has to change.

Indeed JIT is a common cause of mutable code at runtime.

But also unmapping a library and remapping another library in the same memory range can also effectively cause the same address to contain a different instruction that the one predicted but the branch prediction logic (likely not even a branch instruction)

codeflo 8 months ago

How can a reusable function have a fixed return address?

[-]

tsimionescu 8 months ago

This is a good point that I hadn't considered - these are indirect jumps, and the return instruction has special handling from the processor to compute the address specifically, which a jump to a recover address can't have.

eptcyka 8 months ago

Because of the stack? It is not fixed, but once you a function is called, the CPU must know where it was called from.

CoastalCoder 8 months ago

> No, an unconditional jump to a fixed address can't be mispredicted.

Is that even true if an interrupt triggers after the return instruction prediction?

amluto 8 months ago

It may well be possible to do an alternate return, skipping a frame, that is, itself, very reliably predicted correctly. But it still looks like:

    CALL
    jump-without-RET

and the calls and the rets don’t line up. This defeats the return prediction on the next return.

jnordwick 8 months ago

I thought the Branch Target Predictor on x64 was global, not local, and it has to kick in before decode so even direct branches can be mispredicted. Branch prediction is 2 parts - the conditional predictor and the target predictor. The conditional predictor is actually per 64 byte instruction block (so if you have a few branches consecutively they share branch predictor entries and can step on each other. the target predictor uses a global history and needs to happen very early to keep the front end fed.

[-]

ozgrakkurt 8 months ago

Predicting the branch predictor is extremely difficult because it is complex afaik, it is best to test.

All of the interaction between a million caches, predictor, instruction parallelism, different cpus, different code etc. feels like it is impossible to reason about it

IshKebab 8 months ago

Modern CPUs have a "return address stack" which basically mirrors the real stack and allows them to perfectly predict returns (for normal code anyway).

First explanation I found on Google. Haven't read it:

https://one2bla.me/cs6290/lesson4/return-address-stack.html

gpderetta 8 months ago

returns, conditional jumps and indirect jumps have each fairly different prediction profiles. In particular paired call+ret are predicted perfectly at least until the dedicated internal stack doesn't overflow; indirect jumps are, as a general rule, less predictable than conditional jumps as more state (the jump target) need to stored.

kaba0 8 months ago

But in the given context, returning a Return type will almost by necessity involve a conditional at the caller site, so for an apples to apples comparison that should be compared, not a linear return and nothing else.

meindnoch 8 months ago

>Return instructions are perfectly predicted

As long as you don't overwrite the return address on the stack.

mmaniac 8 months ago

The most interesting part of this article for me is at the beginning.

> Now imagine that calls could specify alternate return points, letting the callee decide the statement to return to:

  // Dreamed-up syntax
  fn f() {
        g() alternate |x| {
          dbg!(x); // x = 123
      };
  }

  fn g() -> () alternate i32 {
      return_alternate 123;
  }

This sort of nonlocal control flow immediately calls to mind an implementation in terms of continuation passing style, where the return point is given as a function which is tail called. Nonlocal returns and multiple returns are easy to implement in this style.

Does there exist a language where some function

  fn foo() -> T throws U

is syntactic sugar for something more like?

  fn foo(ifOk: fn(T) -> !, ifExcept: fn(U) -> !) -> !

[-]

compressedgas 8 months ago

See Multi-return Function Call by Olin Shivers and David Fisher https://www.khoury.northeastern.edu/home/shivers/papers/mrlc...

gpderetta 8 months ago

scheme I guess?