The Weird Concept of Branchless Programming

(sanixdk.xyz)

79 points | by judicious 4 hours ago ago

33 comments

  • Joker_vD 20 minutes ago

    Just so you know, the "return x < 0 ? -x : x;" compiles into

        abs_branch:
            mov     eax, edi
            neg     eax
            cmovs   eax, edi
            ret
    
    on x64, and into

        abs_branch:
            srai    a5,a0,31
            xor     a0,a5,a0
            sub     a0,a0,a5
            ret
    
    on RISC-V if you use a C compiler with a half-decent codegen. And "branchy" clamp() translates into

        clamp:
            cmp     edi, edx
            mov     eax, esi
            cmovle  edx, edi
            cmp     edi, esi
            cmovge  eax, edx
            ret
    
    Seriously, the automatic transformation between ?: and if-then-else (in both directions) is quite well studied by now. And if you try to benchmark difference between branching and branchless implementations, please make sure that the branches you expect are actually there in the compiler's output.
  • zackmorris 27 minutes ago

    I need something like this for a switch() command (technically a list of arbitrary functions). Sort of like up to N branches in one step.

    The idea is to take some number of inputs A, B, C, ... and conceptually perform all of the possible functions simultaneously, then keep the one that's desired and throw the rest away. For any arbitrary logic. Ideally using fewer operations than all of that, but that's optional. Driven by one variable, it would look like:

      // branch version
      switch(var1)
      {
        case 1:
          var4 = var2 + var3;
          break;
        case 2:
          var4 = var2 - var3;
          break;
        case 3:
          var4 = var2 * var3;
          break;
        case 4:
          var4 = var2 / var3;
          break;
        // ...
        default:
          var4 = 0; // optional and arbitrary
          break;
      }
      
      // branchless version
      var4 = magic(var1, var2, var3);
    
    I don't know how to do this outside of programmable hardware like an FPGA. The problem is that it's extremely challenging to write/solve the formulas that map ordinary arithmetic and bitwise functions to the larger set of functions.

    So for now, I may just use a switch() command in a shader and let it figure out any reusable logic internally. I don't know if shaders have come far enough to allow that performantly, or if they end up just calculating everything and throwing away the paths not taken. Which would suggest that the max speed would be O(1/N) for the number of cases.

    Does anyone know? Maybe truth tables? Or a SAT solver? I'm also not sure how this would work for floating point, but that's optional too.

    Edit: I updated the cases to show how var1 controls the math performed on var2, var3 and var4.

    • judicious 12 minutes ago

      I’m by no means an expert on this topic, but what about using predefined hash map of value to function call/closure unless that also yields branched programming underneath? Additionally, maybe you want to use macros of some kind to generate the “magic” function you’re looking for, where it constructs a single branchless statement up to N terms, whether it’s procedural macros in Rust or C macros.

  • sfilmeyer 3 hours ago

    I enjoyed reading the article, but I'm pretty thrown by the benchmarks and conclusion. All of the times are reported to a single digit of precision, but then the summary is claiming that one function shows an improvement while the other two are described as negligible. When all the numbers presented are "~5ms" or "~6ms", it doesn't leave me confident that small changes to the benchmarking might have substantially changed that conclusion.

    • gizmo686 35 minutes ago

      Yeah. When your timing results are a single digit multiple of your timing precision, that is a good indication you either need a longer test, or a more precise clock.

      At a 5ms baseline with millisecond precision, the smallest improvement you can measure is 20%. And you cannot distinguish a 20% speedup with a 20% slowdown that happened to get luck with clock ticks.

      For what it is worth, I ran the provided test code on my machine with a 100x increase in iterations and got the following:

        == Benchmarking ABS ==
        ABS (branch):     0.260 sec
        ABS (branchless): 0.264 sec
      
        == Benchmarking CLAMP ==
        CLAMP (branch):     0.332 sec 
        CLAMP (branchless): 0.538 sec
      
        == Benchmarking PARTITION ==
        PARTITION (branch):     0.043 sec
        PARTITION (branchless): 0.091 sec
      
      Which is not exactly encouraging (gcc 13.3.0, -ffast-math -march=native. I did not use the -fomit-this-entire-function flag, which my compiler does not understand).

      I had to drop down to O0 to see branchless be faster in any case:

        == Benchmarking ABS ==
        ABS (branch):     0.743 sec
        ABS (branchless): 0.948 sec
      
        == Benchmarking CLAMP ==
        CLAMP (branch):     4.275 sec
        CLAMP (branchless): 1.429 sec
      
        == Benchmarking PARTITION ==
        PARTITION (branch):     0.156 sec
        PARTITION (branchless): 0.164 sec
    • Joel_Mckay 2 hours ago

      In general, modern compilers will often unroll or inline functions without people even noticing. This often helps with cache level state localization and parallelism.

      Most code should focus on readability, then profile for busy areas under use, and finally refactor the busy areas though hand optimization or register hints as required.

      If one creates something that looks suspect (inline Assembly macro), a peer or llvm build will come along and ruin it later for sure. Have a great day =3

      • hinkley 25 minutes ago

        Doesn’t it also help with branch prediction since the unrolled loop can use different statistics with each copy?

  • MontyCarloHall 3 hours ago

    This great video [0] demonstrates how CPU performance has only increased 1.5-2x over the past 15(!) years when executing extremely branchy code. Really shows you just how deep modern CPU pipelines have become.

    The video also showcases a much more impressive branchless speedup: computing CRC checksums. Doing this naïvely with an if-statement for each bit is ~10x slower than doing it branchless with a single bitwise operation for each bit. The author of the article should consider showcasing this too, since it's a lot more impressive than the measly 1.2x speedup highlighted in the article. I assume the minimal/nonexistent speedups in the article are due to modern CPU branch prediction being quite good. But branch predictors inherently fail miserably at CRC because the conditional is on whether the input bit is 1 or 0, which is essentially random.

    [0] https://www.youtube.com/watch?v=m7PVZixO35c

    • hinkley 23 minutes ago

      Branchless is also useful for cryptographic transforms as it frustrates timing attacks. And that’s a situation where it only needs to be relatively fast compared to the branching alternative because we are trying to improve security while limiting the overhead of doing so.

    • SynasterBeiter 2 hours ago

      The linked video doesn't take into account power consumption of these CPUs. He seems to be comparing laptop CPUs, NUC CPUs and desktop CPUs. If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance.

      • MontyCarloHall 2 hours ago

        Even if you exclude the three mobile CPUs in the charts (the 2012 i5, the 2015 i7, and the 2023 i9 NUC), the results still hold.

        >If you compare a 100W CPU and a 30W CPU that's a couple of years newer, you shouldn't be surprised there isn't much of a difference in performance

        Sure, but this is over a lot more than a couple years. I'd expect a 2023 mobile i9 to be considerably more than twice as fast as a 2007 desktop Core 2 Duo.

  • adrianmonk 2 hours ago

    I've always wondered if any CPUs have tried to reduce the branch penalty by speculatively executing both ways at once in parallel. You'd have two of everything (two pipelines, two ALUs, two sets of registers, etc.) and when you hit a conditional branch, instead of guessing which way to go, you'd essentially fork.

    Obviously that requires a lot of extra transistors and you are doing computation that will be thrown away, so it's not free in terms of space or power/heat/energy. But perhaps it could handle cases that other approaches can't.

    Even more of a wild idea is to pair up two cores and have them work together this way. When you have a core that would have been idle anyway, it can shadow an active core and be its doppelganger that takes the other branch. You'd need to have very fast communication between cores so the shadow core can spring into action instantly when you hit a branch.

    My gut instinct is it's not worth it overall, but I'm curious whether it's been tried in the real world.

    • jasonwatkinspdx an hour ago

      Yes, it's been looked at. If you wanna skim the research use "Eager Execution" and "Disjoint Eager Execution" as jumping off points.

      It doesn't require duplicating everything. You just need to add some additional bookkeeping of dependencies and what to retire vs kill at the end of the pipeline.

      In practice branch predictors are so good that speculating off the "spine" of most likely path just isn't worth it.

      In fact there were a lot of interesting microarchitectural ideas from the late 90s to early 00s that just ended up being moot because the combination of OoO speculation, branch predictors, and big caches proved so effective.

    • thegreatwhale8 2 hours ago

      It's what happens and it gave us a really big issue a few years ago https://en.wikipedia.org/wiki/Spectre_(security_vulnerabilit...

    • hawk_ 2 hours ago

      What has worked out very well in practice is hyper-threading. So you take instructions from two threads and if one of them is waiting on a branch the units of the CPU don't go unused.

    • terryf 2 hours ago

      Yes, this has been done for a while now, speculative execution + register renaming is how this happens. https://en.wikipedia.org/wiki/Register_renaming

    • mshockwave an hour ago

      yes, it has been done for at least a decade if not more

      > Even more of a wild idea is to pair up two cores and have them work together this way

      I don't think that'll be profitable, because...

      > When you have a core that would have been idle anyway

      ...you'll just schedule in another process. Modern OS rarely runs short on available tasks to run

  • styfle 3 hours ago

    This same blog posted yesterday and got flagged with “How I accidently created the fastest CSV parser ever made”

    https://news.ycombinator.com/item?id=45400009

  • Legend2440 3 hours ago

    Article doesn’t mention this, but I’d consider neural networks a form of branchless programming. It’s all a bunch of multiply and threshold operations.

    • david-gpu an hour ago

      > It’s all a bunch of multiply and threshold operations.

      Real-world high-performance matrix multiplication functions do contain branches internally, even on GPUs. If you are ever curious about what that looks like, NVidia maintains an open-source library called CUTLASS.

    • abyesilyurt 3 hours ago

      Thresholding requires branching no?

  • rmnclmnt 3 hours ago

    Great article, triggers some memories.

    When you get to think about branchless programming, especially for SIMD optimizations in the real world, you always learn a lot and it’s as if you get a +1 level on your algorithmic skills. The hardest part then is make sure the tricks are clearly laidout so that someone else can take it from here next time

  • gtirloni an hour ago

    I think I'm getting too sensitive to AI-assisted articles. So many whimsical terms to make something "fun".

  • HarHarVeryFunny an hour ago

    The ARM instruction set used to make all instructions conditional to avoid the need for branching, but has now moved away from that since the branch prediction is so good nowadays.

  • rao-v 2 hours ago

    I’m amused to see a for loop in a function that is purportedly branchless

    (not a critique)

  • mshockwave 3 hours ago

    The article is easy to follow but I think the author missed the e point: branchless programming (a subset of the more known constant time programming) is almost exclusively used in cryptography only nowadays. As shown by the benchmarks in the article, modern branch predictors can easily achieve over 95% if not 99% precision since like a decade ago

  • FooBarBizBazz an hour ago

    I used to do stuff like this (ok, not half as smart), but stopped around 2013 or so, as the distinction between "implementation defined" behavior (ok) and "undefined" behavior (not ok) started to matter and bite.

    After thinking through this carefully, though, I do not see UB (except for signed overflow in a corner case):

    Step 1, bit shift.

    I understand that, until C++20, left shift of a signed int was UB. But this right shift appears to be implementation defined, which is ok if documented in the code.

    Step 2: Add.

    Then, (x + mask) is defined behavior (a) for positive x, since then mask=0, and (b) for most negative x, since then mask=-1. However, if x is numeric_limits::lowest, then you get signed integer overflow, which is UB.

    Step 3, XOR.

    Then the final XOR doesn't have UB AFAICT. It wouldn't be UB as of C++20, when signed integers became two's complement officially. It might have been implementation defined before then, which would be almost as good for something that ubiquitous, but I'm not sure.

    Ok, so I think this does not involve UB except for an input of numeric_limits_lowest.

    Sound about right?

    To fix this, perhaps you would need to make that + an unsigned one?

    It bothers me how hard you need to think to do this language lawyering. C++ is a system of rules. Computers interpret rules. The language lawyer should be a piece of software. I get it, not everything can be done statically, so, fine, do it dynamically? UBSan comes closest in practice, but doesn't detect everything. I understand formally modeled versions of C and C++ have been developed commercially, but these are not open source so they effectively do not exist. It's a strange situation.

    Just the other day I considered writing something branchless and said, "nah", because of uncertainty around UB. How much performance is being left on the table around the world because of similar thinking, driven by the existence of UB?

    Maybe I was supposed to write OCaml or Pascal or Rust.

  • vdupras an hour ago

    In the part about "abs", there's an assembly breakdown:

    mov eax, edi

    sar eax, 31

    mov ecx, eax

    add edi, ecx

    xor eax, edi

    Has this been generated by a C compiler? If yes, it's a bit puzzling, because can't you remove "mov ecx, eax", replace "add edi, ecx" by "add edi, eax" and have the exact same result?

    • userbinator an hour ago

      If you look at compiler output, you will always see plenty of small stupidities.

      • vdupras an hour ago

        So why does conventional wisdom say that compilers will, in the vast majority of the time, outperform programmers doing assembly by hand? It seems contradictory to me.