My first instinct for a poorly predicted branch would be to use a conditional move.
This isn't always a win, because you prevent the CPU from speculating down the wrong path, but you also prevent it from speculating the correct path.
If you really don't care about the failure path and really don't mind unmaintainable low-level hacks, I can think of a few ways to get creative.
First there's the whole array of anti uarch-speculation-exploit tricks in the Kernel that you can use as inspiration to control what the CPU is allowed to speculate. These little bits of assembly were reviewed by engineers from Intel and AMD, so these tricks can't stop working without also breaking the kernel with it.
Another idea is to take inspiration from anti-reverse engineering tricks. Make the failure path an actual exception. I don't mean software stack unwinding, I mean divide by your boolean and then call your send function unconditionally. If the boolean is true, it costs nothing because the result of the division is unused and we just speculate past it. If the boolean is false, the CPU will raise a divide by 0 exception, and this invisible branch will never be predicted by the CPU. Then your exception handler recovers and calls the cold path.
> I asked Claude if there is such a way to basically hard-code branch prediction rules into the machine code, and the answer was that there’s no way to do this on x86, but there is a way on ARM: the BEQP (predict branch taken) and BEQNP (predict branch not taken) instructions.
> Those ARM instructions are just hallucinated, and the reality is actually the other way around: ARM doesn’t have a way of hard-coding ‘predictions’, but x86 does.
My first instinct, knowing less about this domain than maybe I should, would be to abuse the return address predictor. I believe CPUs will generally predict the target of a “ret” instruction using an internal stack of return addresses; some ARM flavours even make this explicit (https://developer.arm.com/documentation/den0042/0100/Unified...).
The way to abuse this would be to put send() on the normal return path and call abandon() by rewriting the return address. In code:
This isn’t exactly correct because it ignores control flow integrity (which you’d have to bypass), doesn’t work like this on every architecture, and abandon() would need to be written partly in assembly to deal with the fact that the stack is in a weird state post-return, but hopefully it conveys the idea anyway.
The if in predict() is implementable as a branchless conditional move. The return address predictor should guess that predict() will return to send(), but in most cases you’ll smash the return address to point at abandon() instead.
Why not just make all the abandon transactions into fake discarded transactions, discard them at the send later. E.g. by poisoning the frame checksum or setting something invalid on them, so they get discarded.
Seems you'd be doing this anyway with the dummy transactions.
Then you have no branch, though may want to add dummy transactions anyway to keep the code in cache.
Make sure your global branch history is the same when "mistraining" and predicting with your BTB. You may end up in the wrong BTB entry and still mess up your prediction :).
The branch taken hint (3EH) was re-added in Redwood Cove (2023), but it's only for static prediction where the branch predictor has not yet encountered the branch - ie, useful for things you would only use once or twice but would likely take the branch. Once the branch predictor has some information the static prediction hint is ignored, so it's best to omit it for anything that will eventually have dynamic branch prediction (ie, run many times), because you'll be consuming bytes in the i-cache which serve no purpose.
Interesting problem! Not a very satisfying solution but I can't think of anything better. Even if there were hints that were respected, you'd still have the problem of the code not being in icache, unless you actually execute it occasionally.
Do cpus really track that much about branches? I know JIT does but where does a cpu find the needed memory to store those counters - and them how does reading those not result in a different miss because the cpu can't speculate until it does the if prediction?
last time I checked a cpu documentation they had a simple rule that branches are always taken, that would be easy for the compiler to code order first. However I don't recall which CPU that was. Still this whole thing feels like a citation needed with me being suspicious it is false. CPU designers know this matters and sometimes compilers have information they don't that users care about: they document how it works. (This is the CPU not the instruction set)
Maybe I'm misinterpreting you, but I think you are vastly underestimating both the accuracy and benefit of branch prediction. Yes, CPU's track that much about branches. Your suspicion that this is false is misguided. Here's Dan Luu's 2017 overview: https://danluu.com/branch-prediction/.
My first instinct for a poorly predicted branch would be to use a conditional move.
This isn't always a win, because you prevent the CPU from speculating down the wrong path, but you also prevent it from speculating the correct path.
If you really don't care about the failure path and really don't mind unmaintainable low-level hacks, I can think of a few ways to get creative.
First there's the whole array of anti uarch-speculation-exploit tricks in the Kernel that you can use as inspiration to control what the CPU is allowed to speculate. These little bits of assembly were reviewed by engineers from Intel and AMD, so these tricks can't stop working without also breaking the kernel with it.
Another idea is to take inspiration from anti-reverse engineering tricks. Make the failure path an actual exception. I don't mean software stack unwinding, I mean divide by your boolean and then call your send function unconditionally. If the boolean is true, it costs nothing because the result of the division is unused and we just speculate past it. If the boolean is false, the CPU will raise a divide by 0 exception, and this invisible branch will never be predicted by the CPU. Then your exception handler recovers and calls the cold path.
> I asked Claude if there is such a way to basically hard-code branch prediction rules into the machine code, and the answer was that there’s no way to do this on x86, but there is a way on ARM: the BEQP (predict branch taken) and BEQNP (predict branch not taken) instructions.
> Those ARM instructions are just hallucinated, and the reality is actually the other way around: ARM doesn’t have a way of hard-coding ‘predictions’, but x86 does.
This made me chuckle. Thanks.
If a human wrote that here (on HN) someone would note the error and the poster would reply:
Yes, sorry, you’re correct. I’ve usually had 97 more double ristrettos by this time in the morning.
Some schools of though suggest this has already happened.
LLMs are just what happens when we hit the singularity of caffeine consumption?
What a fun problem to think about.
My first instinct, knowing less about this domain than maybe I should, would be to abuse the return address predictor. I believe CPUs will generally predict the target of a “ret” instruction using an internal stack of return addresses; some ARM flavours even make this explicit (https://developer.arm.com/documentation/den0042/0100/Unified...).
The way to abuse this would be to put send() on the normal return path and call abandon() by rewriting the return address. In code:
This isn’t exactly correct because it ignores control flow integrity (which you’d have to bypass), doesn’t work like this on every architecture, and abandon() would need to be written partly in assembly to deal with the fact that the stack is in a weird state post-return, but hopefully it conveys the idea anyway.The if in predict() is implementable as a branchless conditional move. The return address predictor should guess that predict() will return to send(), but in most cases you’ll smash the return address to point at abandon() instead.
Why not just make all the abandon transactions into fake discarded transactions, discard them at the send later. E.g. by poisoning the frame checksum or setting something invalid on them, so they get discarded.
Seems you'd be doing this anyway with the dummy transactions.
Then you have no branch, though may want to add dummy transactions anyway to keep the code in cache.
This is literally what is says in TFA lol
Make sure your global branch history is the same when "mistraining" and predicting with your BTB. You may end up in the wrong BTB entry and still mess up your prediction :).
> On modern x86 processors, those instruction prefixes are simply ignored
This sucks.
The branch taken hint (3EH) was re-added in Redwood Cove (2023), but it's only for static prediction where the branch predictor has not yet encountered the branch - ie, useful for things you would only use once or twice but would likely take the branch. Once the branch predictor has some information the static prediction hint is ignored, so it's best to omit it for anything that will eventually have dynamic branch prediction (ie, run many times), because you'll be consuming bytes in the i-cache which serve no purpose.
Interesting problem! Not a very satisfying solution but I can't think of anything better. Even if there were hints that were respected, you'd still have the problem of the code not being in icache, unless you actually execute it occasionally.
Do cpus really track that much about branches? I know JIT does but where does a cpu find the needed memory to store those counters - and them how does reading those not result in a different miss because the cpu can't speculate until it does the if prediction?
last time I checked a cpu documentation they had a simple rule that branches are always taken, that would be easy for the compiler to code order first. However I don't recall which CPU that was. Still this whole thing feels like a citation needed with me being suspicious it is false. CPU designers know this matters and sometimes compilers have information they don't that users care about: they document how it works. (This is the CPU not the instruction set)
Maybe I'm misinterpreting you, but I think you are vastly underestimating both the accuracy and benefit of branch prediction. Yes, CPU's track that much about branches. Your suspicion that this is false is misguided. Here's Dan Luu's 2017 overview: https://danluu.com/branch-prediction/.