86 GB/s bitpacking with ARM SIMD (single thread)

(github.com)

83 points | by ashtonsix 7 hours ago ago

19 comments

  • Retr0id 6 hours ago

    I tried to run the benchmark on my M1 Pro macbook, but the "baseline" is written with x86 intrinsics and won't compile.

    Are the benchmark results in the README real? (The README itself feels very AI-generated)

    Looking at the makefile, it tries to link the x86 SSE "baseline" implementation and the NEON version into the same binary. A real headscratcher!

    Edit: The SSE impl gets shimmed via simd-everywhere, and the benchmark results do seem legit (aside from being slightly apples-to-oranges, but that's unavoidable)

    • ashtonsix 5 hours ago

      Thank you so much for attempting a reproduction! (I posted this on Reddit and most commenters didn't even click the link)

      For the baseline you need SIMDe headers: https://github.com/simd-everywhere/simde/tree/master/simde. These alias x86 intrinsics to ARM intrinsics. The baseline is based on the previous State-of-The-Art (https://arxiv.org/abs/1209.2137) which happens to be x86-based; using SIMDe to compile was the highest-integrity way I could think of to compare with the previous SOTA.

      Note: M1 chips specifically have notoriously bad small-shift performance, so the benchmark results will be very bad on your machine. M3 partially fixed this, M4 fixed completely. My primary target is server-class rather than consumer-class hardware so I'm not too worried about this.

      The benchmark results were cpy-pasted from the terminal. The README prose was AI generated from my rough notes (I'm confident when communicating with other experts/researchers, but less-so with communication to a general audience).

      • ozgrakkurt 3 hours ago

        Super cool!

        Pretty sure anyone going into this kind of post about simd would prefer your writing to llm

    • Asmod4n 6 hours ago
      • Retr0id 6 hours ago

        But this project isn't using simd-everywhere. I'd like to reproduce the results as documented in the README

        • guipsp 6 hours ago

          Look at the parent dir. I agree it is a bit confusing

  • danlark1 6 hours ago

    Great work!

    Popular narrative that NEON does not have a move mask alternative. Some time ago I published an article to simulate popular bit packing use cases with NEON with 1-2 instructions. This does not include unpacking cases but can be great for real world applications like compare+find, compare+iterate, compare+test.

    https://community.arm.com/arm-community-blogs/b/servers-and-...

    • Sesse__ 4 hours ago

      I never understood why they couldn't just include a movmskb instruction to begin with. It's massively useful for integer tasks, not expensive to implement as far as I know, and the vshrn+mov trick often requires an extra instruction either in front or after (in addition to just being, well, pretty obscure).

      NEON in general is a bit sad, really; it's built around the idea of being implementable with a 64-bit ALU, and it shows. And SVE support is pretty much non-existent on the client.

      • ashtonsix 3 hours ago

        Not having (1 << k) - 1 as a single instruction sucks when it HAS to be in a hot loop, but you can usually hoist this to the loop prolougue: my stuff uses dummy inline assembly hints to force compilers to do this `asm volatile("" : "+w"(m));`.

        I personally think calibrating ARM's ISA on smaller VL was a good choice: you get much better IPC. You also have an almost-complete absence of support for 8-bit elements with x86 ISAs, so elements per instruction is tied. And NEON kind-of-ish makes up for its small VL with multi-register TBL/TBX and LDP/STP.

        Also: AVX512 is just as non-existent on clients as SVE2; although not really relevant for the server-side targets I'm optimising for (mostly OLAP).

        • TinkersW 3 hours ago

          8 bit is not absent in x86 SIMD, it is a slightly less covered than 32 & 16 bit, but you can fully implement all the common 8 bit ops and most are 1 instruction(with AVX2). There are even various horizontal ops on 8 bit values(avg, dot etc).

          Also AVX512 is way more common than SVE2, all Zen4 & Zen5 support it.

          • dzaima 2 hours ago

            More specifically, basically the only absent 8-bit ops that have 32-bit equivalents in AVX2 are shifts and multiplies. Shifts are quite annoying (though, with a uniform shift they can be emulated on AVX-512 via GFNI abuse in 1 instr), multiplies are rather rare (though note that there is vpmaddubsw for an 8-bit→16-bit multiply-add). There's even a case of the opposite - saturating add/sub exist for 8-bit and 16-bit ints, but not wider.

        • Sesse__ an hour ago

          Does _any_ SIMD instruction set have (1 << k) - 1 as a single instruction?

          • camel-cdr 7 minutes ago

            Not sure in which context this is used, but you can do -1 << k in most ISAs but that still requires a bit-not. But if you want to use the value in a bitwise instruction, then there are often variants that can invert the input operand.

            E.g. in RVV instead of vand.vv(a, vadd.vi(vsll.vv(1,k),-1)) you could do vandn.vv(vsll.vv(-1,k))

            AVX-512 can do this with any binary or ternary bitwise logic function via vpternlog

    • ashtonsix 5 hours ago

      Nice article! I personally find the ARM ISA far more cohesive than x86's: far less historical quirks. I also really appreciate the ubiquity of support for 8-bit elements in ARM and the absence of SMT (make performance much more predictable).

  • fzeroff 17 minutes ago

    Very interesting! Uh you don’t mind my basic question, but what would you use that for?

  • robert3005 4 hours ago

    Highly recommend https://www.vldb.org/pvldb/vol16/p2132-afroozeh.pdf for a comparable algorithm. It generalizes to arbitrary input and output bit widths.

    • ashtonsix 4 hours ago

      Good work, wish I saw it earlier as it overlaps with a lot of my recent work. I'm actually planning to release new SOTAs on zigzag/delta/delta-of-delta/xor-with-previous coding next week. Some areas the work doesn't give enough attention to (IMO): register pressure, kernel fusion, cache locality (wrt multi-pass). They also fudge a lot of the numbers and comparisons eg, pitching themselves against Parquet (storage-optimised) when Arrow (compute-optimised) is the most-comparable tech and obvious target to beat. They definitely improve on current best work, but only by a modest margin.

      I'm also skeptical of the "unified" paradigm: performance improvements are often realised by stepping away from generalisation and exploiting the specifics of a given problem; under a unified paradigm there's definite room for improvement vs Arrow, but that's very unlikely to bring you all the way to theoretically optimal performance.