Just so you know, the "return x < 0 ? -x : x;" compiles into
abs_branch:
mov eax, edi
neg eax
cmovs eax, edi
ret
on x64, and into
abs_branch:
srai a5,a0,31
xor a0,a5,a0
sub a0,a0,a5
ret
on RISC-V if you use a C compiler with a half-decent codegen. And "branchy" clamp() translates into
clamp:
cmp edi, edx
mov eax, esi
cmovle edx, edi
cmp edi, esi
cmovge eax, edx
ret
Seriously, the automatic transformation between ?: and if-then-else (in both directions) is quite well studied by now. And if you try to benchmark difference between branching and branchless implementations, please make sure that the branches you expect are actually there in the compiler's output.
I need something like this for a switch() command (technically a list of arbitrary functions). Sort of like up to N branches in one step.
The idea is to take some number of inputs A, B, C, ... and conceptually perform all of the possible functions simultaneously, then keep the one that's desired and throw the rest away. For any arbitrary logic. Ideally using fewer operations than all of that, but that's optional. Driven by one variable, it would look like:
// branch version
switch(var1)
{
case 1:
var4 = var2 + var3;
break;
case 2:
var4 = var2 - var3;
break;
case 3:
var4 = var2 * var3;
break;
case 4:
var4 = var2 / var3;
break;
// ...
default:
var4 = 0; // optional and arbitrary
break;
}
// branchless version
var4 = magic(var1, var2, var3);
I don't know how to do this outside of programmable hardware like an FPGA. The problem is that it's extremely challenging to write/solve the formulas that map ordinary arithmetic and bitwise functions to the larger set of functions.
So for now, I may just use a switch() command in a shader and let it figure out any reusable logic internally. I don't know if shaders have come far enough to allow that performantly, or if they end up just calculating everything and throwing away the paths not taken. Which would suggest that the max speed would be O(1/N) for the number of cases.
Does anyone know? Maybe truth tables? Or a SAT solver? I'm also not sure how this would work for floating point, but that's optional too.
Edit: I updated the cases to show how var1 controls the math performed on var2, var3 and var4.
I’m by no means an expert on this topic, but what about using predefined hash map of value to function call/closure unless that also yields branched programming underneath? Additionally, maybe you want to use macros of some kind to generate the “magic” function you’re looking for, where it constructs a single branchless statement up to N terms, whether it’s procedural macros in Rust or C macros.
I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.
Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.
At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.
For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:
Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).
I had to drop down to O0 to see branchless be faster in any case:
In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.
Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.
If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3
This great video [0] demonstrates how CPU performance has only increased 1.5-2x over the past 15(!) years when executing extremely branchy code. Really shows you just how deep modern CPU pipelines have become.
The video also showcases a much more impressive branchless speedup: computing CRC checksums. Doing this naïvely with an if-statement for each bit is ~10x slower than doing it branchless with a single bitwise operation for each bit. The author of the article should consider showcasing this too, since it's a lot more impressive than the measly 1.2x speedup highlighted in the article. I assume the minimal/nonexistent speedups in the article are due to modern CPU branch prediction being quite good. But branch predictors inherently fail miserably at CRC because the conditional is on whether the input bit is 1 or 0, which is essentially random.
Branchless is also useful for cryptographic transforms as it frustrates timing attacks. And that’s a situation where it only needs to be relatively fast compared to the branching alternative because we are trying to improve security while limiting the overhead of doing so.
The linked video doesn't take into account power consumption of these CPUs. He seems to be comparing laptop CPUs, NUC CPUs and desktop CPUs. If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance.
Even if you exclude the three mobile CPUs in the charts (the 2012 i5, the 2015 i7, and the 2023 i9 NUC), the results still hold.
>If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance
Sure, but this is over a lot more than a couple years. I'd expect a 2023 mobile i9 to be considerably more than twice as fast as a 2007 desktop Core 2 Duo.
I've always wondered if any CPUs have tried to reduce the branch penalty by speculatively executing both ways at once in parallel. You'd have two of everything (two pipelines, two ALUs, two sets of registers, etc.) and when you hit a conditional branch, instead of guessing which way to go, you'd essentially fork.
Obviously that requires a lot of extra transistors and you are doing computation that will be thrown away, so it's not free in terms of space or power/heat/energy. But perhaps it could handle cases that other approaches can't.
Even more of a wild idea is to pair up two cores and have them work together this way. When you have a core that would have been idle anyway, it can shadow an active core and be its doppelganger that takes the other branch. You'd need to have very fast communication between cores so the shadow core can spring into action instantly when you hit a branch.
My gut instinct is it's not worth it overall, but I'm curious whether it's been tried in the real world.
Yes, it's been looked at. If you wanna skim the research use "Eager Execution" and "Disjoint Eager Execution" as jumping off points.
It doesn't require duplicating everything. You just need to add some additional bookkeeping of dependencies and what to retire vs kill at the end of the pipeline.
In practice branch predictors are so good that speculating off the "spine" of most likely path just isn't worth it.
In fact there were a lot of interesting microarchitectural ideas from the late 90s to early 00s that just ended up being moot because the combination of OoO speculation, branch predictors, and big caches proved so effective.
What has worked out very well in practice is hyper-threading. So you take instructions from two threads and if one of them is waiting on a branch the units of the CPU don't go unused.
Article doesn’t mention this, but I’d consider neural networks a form of branchless programming. It’s all a bunch of multiply and threshold operations.
> It’s all a bunch of multiply and threshold operations.
Real-world high-performance matrix multiplication functions do contain branches internally, even on GPUs. If you are ever curious about what that looks like, NVidia maintains an open-source library called CUTLASS.
When you get to think about branchless programming, especially for SIMD optimizations in the real world, you always learn a lot and it’s as if you get a +1 level on your algorithmic skills. The hardest part then is make sure the tricks are clearly laidout so that someone else can take it from here next time
The ARM instruction set used to make all instructions conditional to avoid the need for branching, but has now moved away from that since the branch prediction is so good nowadays.
The article is easy to follow but I think the author missed the e point: branchless programming (a subset of the more known constant time programming) is almost exclusively used in cryptography only nowadays. As shown by the benchmarks in the article, modern branch predictors can easily achieve over 95% if not 99% precision since like a decade ago
I used to do stuff like this (ok, not half as smart), but stopped around 2013 or so, as the distinction between "implementation defined" behavior (ok) and "undefined" behavior (not ok) started to matter and bite.
After thinking through this carefully, though, I do not see UB (except for signed overflow in a corner case):
Step 1, bit shift.
I understand that, until C++20, left shift of a signed int was UB. But this right shift appears to be implementation defined, which is ok if documented in the code.
Step 2: Add.
Then, (x + mask) is defined behavior (a) for positive x, since then mask=0, and (b) for most negative x, since then mask=-1. However, if x is numeric_limits::lowest, then you get signed integer overflow, which is UB.
Step 3, XOR.
Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as of C++20, when signed integers became two's complement officially. It might have been implementation defined before then, which would be almost as good for something that ubiquitous, but I'm not sure.
Ok, so I think this does not involve UB except for an input of numeric_limits_lowest.
Sound about right?
To fix this, perhaps you would need to make that + an unsigned one?
It bothers me how hard you need to think to do this language lawyering. C++ is a system of rules. Computers interpret rules. The language lawyer should be a piece of software. I get it, not everything can be done statically, so, fine, do it dynamically? UBSan comes closest in practice, but doesn't detect everything. I understand formally modeled versions of C and C++ have been developed commercially, but these are not open source so they effectively do not exist. It's a strange situation.
Just the other day I considered writing something branchless and said, "nah", because of uncertainty around UB. How much performance is being left on the table around the world because of similar thinking, driven by the existence of UB?
Maybe I was supposed to write OCaml or Pascal or Rust.
In the part about "abs", there's an assembly breakdown:
mov eax, edi
sar eax, 31
mov ecx, eax
add edi, ecx
xor eax, edi
Has this been generated by a C compiler? If yes, it's a bit puzzling, because can't you remove "mov ecx, eax", replace "add edi, ecx" by "add edi, eax" and have the exact same result?
So why does conventional wisdom say that compilers will, in the vast majority of the time, outperform programmers doing assembly by hand? It seems contradictory to me.
Just so you know, the "return x < 0 ? -x : x;" compiles into
on x64, and into on RISC-V if you use a C compiler with a half-decent codegen. And "branchy" clamp() translates into Seriously, the automatic transformation between ?: and if-then-else (in both directions) is quite well studied by now. And if you try to benchmark difference between branching and branchless implementations, please make sure that the branches you expect are actually there in the compiler's output.I need something like this for a switch() command (technically a list of arbitrary functions). Sort of like up to N branches in one step.
The idea is to take some number of inputs A, B, C, ... and conceptually perform all of the possible functions simultaneously, then keep the one that's desired and throw the rest away. For any arbitrary logic. Ideally using fewer operations than all of that, but that's optional. Driven by one variable, it would look like:
I don't know how to do this outside of programmable hardware like an FPGA. The problem is that it's extremely challenging to write/solve the formulas that map ordinary arithmetic and bitwise functions to the larger set of functions.So for now, I may just use a switch() command in a shader and let it figure out any reusable logic internally. I don't know if shaders have come far enough to allow that performantly, or if they end up just calculating everything and throwing away the paths not taken. Which would suggest that the max speed would be O(1/N) for the number of cases.
Does anyone know? Maybe truth tables? Or a SAT solver? I'm also not sure how this would work for floating point, but that's optional too.
Edit: I updated the cases to show how var1 controls the math performed on var2, var3 and var4.
I’m by no means an expert on this topic, but what about using predefined hash map of value to function call/closure unless that also yields branched programming underneath? Additionally, maybe you want to use macros of some kind to generate the “magic” function you’re looking for, where it constructs a single branchless statement up to N terms, whether it’s procedural macros in Rust or C macros.
I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.
Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.
At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.
For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:
Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).I had to drop down to O0 to see branchless be faster in any case:
In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.
Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.
If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3
Doesn’t it also help with branch prediction since the unrolled loop can use different statistics with each copy?
This great video [0] demonstrates how CPU performance has only increased 1.5-2x over the past 15(!) years when executing extremely branchy code. Really shows you just how deep modern CPU pipelines have become.
The video also showcases a much more impressive branchless speedup: computing CRC checksums. Doing this naïvely with an if-statement for each bit is ~10x slower than doing it branchless with a single bitwise operation for each bit. The author of the article should consider showcasing this too, since it's a lot more impressive than the measly 1.2x speedup highlighted in the article. I assume the minimal/nonexistent speedups in the article are due to modern CPU branch prediction being quite good. But branch predictors inherently fail miserably at CRC because the conditional is on whether the input bit is 1 or 0, which is essentially random.
[0] https://www.youtube.com/watch?v=m7PVZixO35c
Branchless is also useful for cryptographic transforms as it frustrates timing attacks. And that’s a situation where it only needs to be relatively fast compared to the branching alternative because we are trying to improve security while limiting the overhead of doing so.
The linked video doesn't take into account power consumption of these CPUs. He seems to be comparing laptop CPUs, NUC CPUs and desktop CPUs. If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance.
Even if you exclude the three mobile CPUs in the charts (the 2012 i5, the 2015 i7, and the 2023 i9 NUC), the results still hold.
>If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance
Sure, but this is over a lot more than a couple years. I'd expect a 2023 mobile i9 to be considerably more than twice as fast as a 2007 desktop Core 2 Duo.
I've always wondered if any CPUs have tried to reduce the branch penalty by speculatively executing both ways at once in parallel. You'd have two of everything (two pipelines, two ALUs, two sets of registers, etc.) and when you hit a conditional branch, instead of guessing which way to go, you'd essentially fork.
Obviously that requires a lot of extra transistors and you are doing computation that will be thrown away, so it's not free in terms of space or power/heat/energy. But perhaps it could handle cases that other approaches can't.
Even more of a wild idea is to pair up two cores and have them work together this way. When you have a core that would have been idle anyway, it can shadow an active core and be its doppelganger that takes the other branch. You'd need to have very fast communication between cores so the shadow core can spring into action instantly when you hit a branch.
My gut instinct is it's not worth it overall, but I'm curious whether it's been tried in the real world.
Yes, it's been looked at. If you wanna skim the research use "Eager Execution" and "Disjoint Eager Execution" as jumping off points.
It doesn't require duplicating everything. You just need to add some additional bookkeeping of dependencies and what to retire vs kill at the end of the pipeline.
In practice branch predictors are so good that speculating off the "spine" of most likely path just isn't worth it.
In fact there were a lot of interesting microarchitectural ideas from the late 90s to early 00s that just ended up being moot because the combination of OoO speculation, branch predictors, and big caches proved so effective.
It's what happens and it gave us a really big issue a few years ago https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...
What has worked out very well in practice is hyper-threading. So you take instructions from two threads and if one of them is waiting on a branch the units of the CPU don't go unused.
Yes, this has been done for a while now, speculative execution + register renaming is how this happens. https://en.wikipedia.org/wiki/Register_renaming
yes, it has been done for at least a decade if not more
> Even more of a wild idea is to pair up two cores and have them work together this way
I don't think that'll be profitable, because...
> When you have a core that would have been idle anyway
...you'll just schedule in another process. Modern OS rarely runs short on available tasks to run
This same blog posted yesterday and got flagged with “How I accidently created the fastest CSV parser ever made”
https://news.ycombinator.com/item?id=45400009
Article doesn’t mention this, but I’d consider neural networks a form of branchless programming. It’s all a bunch of multiply and threshold operations.
> It’s all a bunch of multiply and threshold operations.
Real-world high-performance matrix multiplication functions do contain branches internally, even on GPUs. If you are ever curious about what that looks like, NVidia maintains an open-source library called CUTLASS.
Thresholding requires branching no?
No, you can do that without branching. See https://en.algorithmica.org/hpc/pipelining/branchless/#predi... for example.
As demonstrated in the article, you can compute clamp(x, min, max) with straight-line code.
No, it’s just a math operation and can be applied to any amount of data in parallel.
Great article, triggers some memories.
When you get to think about branchless programming, especially for SIMD optimizations in the real world, you always learn a lot and it’s as if you get a +1 level on your algorithmic skills. The hardest part then is make sure the tricks are clearly laidout so that someone else can take it from here next time
I think I'm getting too sensitive to AI-assisted articles. So many whimsical terms to make something "fun".
The ARM instruction set used to make all instructions conditional to avoid the need for branching, but has now moved away from that since the branch prediction is so good nowadays.
I’m amused to see a for loop in a function that is purportedly branchless
(not a critique)
The article is easy to follow but I think the author missed the e point: branchless programming (a subset of the more known constant time programming) is almost exclusively used in cryptography only nowadays. As shown by the benchmarks in the article, modern branch predictors can easily achieve over 95% if not 99% precision since like a decade ago
I used to do stuff like this (ok, not half as smart), but stopped around 2013 or so, as the distinction between "implementation defined" behavior (ok) and "undefined" behavior (not ok) started to matter and bite.
After thinking through this carefully, though, I do not see UB (except for signed overflow in a corner case):
Step 1, bit shift.
I understand that, until C++20, left shift of a signed int was UB. But this right shift appears to be implementation defined, which is ok if documented in the code.
Step 2: Add.
Then, (x + mask) is defined behavior (a) for positive x, since then mask=0, and (b) for most negative x, since then mask=-1. However, if x is numeric_limits::lowest, then you get signed integer overflow, which is UB.
Step 3, XOR.
Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as of C++20, when signed integers became two's complement officially. It might have been implementation defined before then, which would be almost as good for something that ubiquitous, but I'm not sure.
Ok, so I think this does not involve UB except for an input of numeric_limits_lowest.
Sound about right?
To fix this, perhaps you would need to make that + an unsigned one?
It bothers me how hard you need to think to do this language lawyering. C++ is a system of rules. Computers interpret rules. The language lawyer should be a piece of software. I get it, not everything can be done statically, so, fine, do it dynamically? UBSan comes closest in practice, but doesn't detect everything. I understand formally modeled versions of C and C++ have been developed commercially, but these are not open source so they effectively do not exist. It's a strange situation.
Just the other day I considered writing something branchless and said, "nah", because of uncertainty around UB. How much performance is being left on the table around the world because of similar thinking, driven by the existence of UB?
Maybe I was supposed to write OCaml or Pascal or Rust.
In the part about "abs", there's an assembly breakdown:
mov eax, edi
sar eax, 31
mov ecx, eax
add edi, ecx
xor eax, edi
Has this been generated by a C compiler? If yes, it's a bit puzzling, because can't you remove "mov ecx, eax", replace "add edi, ecx" by "add edi, eax" and have the exact same result?
If you look at compiler output, you will always see plenty of small stupidities.
So why does conventional wisdom say that compilers will, in the vast majority of the time, outperform programmers doing assembly by hand? It seems contradictory to me.