Why those particular integer multiplies?

(fgiesen.wordpress.com)

67 points | by luu 13 hours ago ago

31 comments

  • RaisingSpear 7 hours ago

    I suspect Intel uses 32x32b multipliers instead of his theorised 16x16b, just that it only has one every second lane. It lines up more closely with VPMULLQ, and it seems odd that PMULUDQ would be one uOp vs PMULLD's two.

    PMULLD is probably just doing 2x PMULUDQ and discarding the high bits.

    (I tried commenting on his blog but it's awaiting moderation - I don't know if that's ever checked, or just sits in the queue forever)

    • anonymoushn 2 hours ago

      Makes sense to me. I have some code that uses a lot of mullo, so I get to pay twice the latency compared to if I wanted full multiplies...

  • TekMol 9 hours ago

    How can software run on different CPUs when they support different operations?

    When you download "debian-live-12.7.0-amd64-kde.iso", all the programs in the repos support all current Intel and AMD CPUs, right? Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user's CPU?

    Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user's CPU?

    • theresistor 9 hours ago

      > Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user's CPU?

      Mostly the former. Some highly optimized bits of software do the latter—they are built with multiple code paths optimized for different hardware capabilities, and select which one to use at runtime.

      > Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user's CPU?

      Hypothetically yes, but in practice no for the languages you mentioned because they don't map well to things like SIMD. Some JIT-based numerical computing systems as well as JIT-based ML compilers do reap those benefits.

      • twic 6 hours ago

        It's possibly worth mentioning that Java is getting a vector API which explicitly abstracts over some of the details of SIMD, including width. You have a type Vector<T> which represents enough of some type T to fill a vector register (eg eight 32-bit numbers in a 256-bit register), operations on Vector<T> which produce another Vector<T>, and some way to break arrays up into Vectors of the right size for the platform. The API is a bit clunky, but you write code with it, the compiler performs a miracle, and efficient platform-specific vector code comes out.

        https://docs.oracle.com/en/java/javase/23/docs/api/jdk.incub...

        https://docs.oracle.com/en/java/javase/23/docs/api/jdk.incub...

        • thfuran 6 hours ago

          Though it's pretty much incubating forever, or until Valhalla, whichever comes first.

      • jsheard 8 hours ago

        .NET/C# does pretty well with SIMD for a high level language, it has portable SIMD primitives which get JITed to whatever the system supports at runtime, and they're used quite extensively throughout the stdlib so you benefit even if you're not writing SIMD routines yourself.

        They tried to do something similar in Javascript but it added way too much complexity to the runtimes and ended up getting dropped in favor of WASM SIMD.

    • Conscat 3 hours ago

      I recently implemented a runtime for `__builtin_cpu_init()`, `__builtin_cpu_supports()`, and `__builtin_cpu_is()` for x86-64. Using these compiler intrinsics, or a higher level feature such as `[[gnu::cpu_dispatch]]`, you can write functions that behave differently on different CPUs. Fortunately the implementation isn't terribly complex. On x86, it's based around a neat `cpuid` instruction, and other ISAs have similar features.

      https://github.com/Cons-Cat/libCat/blob/main/src%2Flibraries...

    • TinkersW 8 hours ago

      SSE2 is a requirement for x86-64, which gives at least a reasonable(128bit wide SIMD) baseline.

      SSE4 is from 2008, so making it a requirement isn't unreasonable.

      Even AVX2 is from 2013, so some apps require it nowadays.

      It is extremely difficult for a compiler to convert scalar code to SIMD automatically, even static C++ compilers really suck at it.

      A dynamic compiler for javascript would have no real hope of any meaningful gains.

      • Sesse__ 7 hours ago

        The problem is that there were CPUs made well after 2008 that don't support SSE4. In particular, Phenom II was fairly popular, sold up until 2012, and doesn't even support SSSE3 (much less SSE4.1 and SSE4.2; only an AMD-specific variant known as SSE4a).

        • Narishma 7 hours ago

          The early Atoms too only supported up to SSSE3.

      • gus_massa 8 hours ago

        I still have an old Samsung that is from 2008 aproximately. The battery last like 10 minutes, a few keys are dead, the fan makes a weird sound, so it's 99.9% retired. I still use it every few years when I need an old version of MS Office.

      • adgjlsfhk1 8 hours ago

        other thing about avx2 is it gives you FMA because of the timing.

    • sorenjan 5 hours ago

      > Or do they somehow adapt to the operations supported by the user's CPU?

      This is called runtime dispatch. You can do it manually or use a library, like Google Highway. GCC supports multiversioning where you write separate versions of a function and the right one is selected at runtime.

      https://github.com/google/highway

      https://gcc.gnu.org/onlinedocs/gcc-9.1.0/gcc/Function-Multiv...

    • jsheard 8 hours ago

      Others gave you the general answer, but in OPs line of work they just manually rewrite and tune all of the core algorithms a dozen times for different CPU architectures and dispatch to the most suitable one at runtime. I don't have a link to hand but IIRC they go a step beyond dispatching based on CPU features, and dispatch different code paths for CPUs with the same features but significantly different instruction costs.

      RADs codecs are expensive but that's the expertise you're paying for.

      • anonymoushn 2 hours ago

        A recent example of "feature detection vs specific cpus with different costs for the same features" thing is pext on zen2. It's implemented in microcode and the implementation is so slow that we'd honestly be better off if the chips reported that they did not have the feature.

    • Denvercoder9 8 hours ago

      > Do they just target the lowest common denominator of operations? Or do they somehow adapt to the operations supported by the user's CPU?

      Mostly the former, some specialized software does the latter. The lowest common denominator is called the baseline, and it differs over time and between distributions. Debian for example still supports x86-64-v1 (the original 64-bit extension to x86), but RHEL 10 will require x86-64-v3, which includes SSE4 and AVX2 support.

    • o11c 4 hours ago

      > the lowest common denominator of operations?

      Note that in recent years the chosen LCD for some distros has changed - they're starting to target the v2 feature set rather than the original.

      See https://developers.redhat.com/blog/2021/01/05/building-red-h...

      > Do dynamic languages (Javascript, Python, PHP...) get a speed boost because they can compile just in time and use all the features of the user's CPU?

      Dynamically-typed languages can't benefit from this at all (they may include a C library that uses runtime dispatch though). Statically-typed JIT'ed languages like Java can (and you see occasional "look, Java is faster than C" benchmarks citing this), but only if you avoid classes and use only arrays. C# can do better than Java but still suffers from its Windows-centric history.

      • neonsunset 2 hours ago

        > but only if you avoid classes and use only arrays

        Please do look into the kind of codegen emitted by OpenJDK and .NET before assuming this. It's a bit difficult with -XX:+PrintAssembly and much easier with DOTNET_JitDisasm='pattern'/Disasmo/NativeAOT+Ghidra. Once you do, you will clearly see how the exact set of ISA extensions influences instruction selection for all sorts of operations like stack zeroing, loads/stores, loop vectorization (automatic or manual), etc. .NET has extensive intrinsics and portable SIMD APIs that use effectively static dispatch even if the path is still picked at runtime, but just once during JIT compilation.

        > still suffers from its Windows-centric history.

        This is a provably wrong, especially in peformance-related scenarios.

    • marcosdumay 2 hours ago

      Just to add, Debian has a nice alternatives system that can tailor the correct version of libraries for your specific system. What happens for a few performance sensitive ones.

      But yeah, it's mostly code compiled to the lowest common spec, and a bit of code with dynamic dispatching.

  • Const-me 9 hours ago

    Found a bug in the article.

    Maximum for signed bytes is +127, not +128. Minimum is correct, it's -128.

    • pornel 3 hours ago

      BTW, this asymmetry makes unary negation in C an unexpected source of Undefined Behavior.

    • Jerrrrrrry 5 hours ago

      You can always tell when someone counts on their fingers.

  • NooneAtAll3 11 hours ago

    > PMADDUBSW produces a word result which, in turns out, does not quite work. The problem is that multiplying unsigned by signed bytes means the individual product terms are in range [-128*255, 128*255] = [-32640,32640]. Our result is supposed to be a signed word, which means its value range is [-32768,32767]. If the two individual products are either near the negative or positive end of the possible output range, the sum overflows.

    can someone explain this to me? isn't 32640 < 32767? how's this an overflow?

    • anonymoushn 11 hours ago

      The output of the instruction is, for each 16-bit lane, the sum of two products of one i8 and one u8.

      32640 * 2 > 32767

      As an aside, the quoted section of the article seems to have an error. The maximum value of an i8 is 127 and the maximum value of one of these products is 32385.

  • secondcoming 4 hours ago

    It's a shame that SIMD is still a dark art. I've looked at writing a few simple algorithms with it but have to do it in my own time as it'll be difficult to justify it with my employer. I do know that gcc is generally terrible at auto-vectorising code, clang is much better but far from perfect. Using intrinsics directly will just lead to code that's unmaintainable by others not versed in the dark art. Even wrappers over intrinsics don't help much here. I feel there's a lot of efficiency being left on the table because these instructions aren't being used more.

    • Miniminix 2 hours ago

      Re: SIMD

      Suggest you look at the Julia Language, a high-level but still capable of C-like speed.

      It has built in support for SIMD (and GPU) processing.

      Julia is designed to support Scientific Computing, with a growing library spanning different domains.

      https://docs.julialang.org/en/v1/

    • Sesse__ 4 hours ago

      The problem is that the different SIMD instruction sets are genuinely... different. The basics of “8-bit unsigned add” and similar are possible to abstract over, but for a lot of cases, you may have to switch your entire algorithm around between different CPUs to get reasonable performance (or even gain over the scalar code at all). There's no way a compiler or SIMD abstraction library will do that for you.

  • wruza 10 hours ago

    Maybe it’s me in the morning, but for some reason it was a very hard read for the text about cpu instructions. Feels like it loads you with details for ages.

    • flohofwoe 8 hours ago

      New to ryg blog posts? :)

      • wruza 3 hours ago

        Not sure what was so wrong with that or why people like it so much, but yeah.