36 comments

  • jvanderbot 20 hours ago

    I've seen this over and over. One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution. Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.

    If your algorithm is the equivalent of a couple of nested iterations, you have essentially three options: parallelize outer, inner, or both. In the vast majority of the cases I've run into, you want thread/task level parallelism on the outer loop (only), and if required, data/simd parallelism on the inner loop(s).

    It's a rule of thumb, but it biases towards batches of work assigned to CPUs for a decent amount of time, allowing cache locality and pipelining to kick in. That's even before SIMD.

    • menaerus 10 hours ago

      > One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution

      Valid concern but I don't think this was the OP case though?

      From my understanding, author gained the most benefits by dumbing down the generic rayon implementation to the same kind (thread-pool with task queues) but with different work-stealing algorithm.

      > Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.

      Work-stealing by definition kinda makes distributing the work "correctly" a difficult task, doesn't it?

      • jvanderbot an hour ago

        Well, sure, in practice work stealing makes correct distribution difficult, but in theory, work stealing is to repair an incorrect work distribution, right?

        If every CPU is 100% utilized without needing context switch (and running the right number of worker threads without switching those), then work stealing is not required.

        But my comment is solidly "Rule of thumb". I claim no theoretical basis other than "Giving fewer longer tasks to fewer threads, (still >= number of worker threads), is better than giving more shorter ones"

    • cogman10 19 hours ago

      The rule of thumb also keeps you from doing a lot of task switching. It isn't free enqueue and dequeue tasks. It is better if you have a million things to do to have a smaller set of tasks. Especially if the runtime for those tasks are somewhat uniform.

      • mcronce 15 hours ago

        For sure. Context switching tasks is certainly a lot cheaper than context switching threads, but it isn't free.

    • dboreham 18 hours ago

      Larger grain size better.

  • xnorswap 12 hours ago

    I'm confused by the before / after graph, this is how I'm reading the graphs:

    The "before" graph shows a ~300ms wall-clock time which drops to ~150ms parallelised.

    The "after" graph shows a ~27ms wall-clock time which drops to ~17ms when parallelised.

    Isn't all the improvement still outside of the parallelisation then? There's an awful lot of discussion about parallelisation and the behaviour of schedulers given there wasn't any improvement of how parallelisable the end result was?

    • meheleventyone 10 hours ago

      The reported benefit of using the custom threading implementation over rayon was 20% according to the article. So not nothing but not the biggest win. If they were able to rejig the algorithm so they could parralelize the outer loop there's probably a bigger win to be had.

  • ModernMech 17 hours ago

    Per the article's "Using the right profiling tools", I have gone through a similar profiling journey in Rust and found this tool very useful: https://superluminal.eu/rust/

    Works on Windows, has a great GUI, shows pretty flame charts, really pleasant to use. Costs money but worth it.

    • datadeft 6 hours ago

      > Superluminal is the new industry-standard profiling tool.

      Never heard of it and I do a lot of performance related work. Is Superluminal really that popular?

      • ModernMech 4 hours ago

        I don't think it's popular but it really worked great for me!

    • mgoetzke 5 hours ago

      Sadly only for windows though

    • yazzku 16 hours ago

      Superluminal is a sampling profiler for the most part. It works great for what it does, sure. But in the author's own words:

      > So far, we’ve only used perf to record stack traces at a regular time interval. This is useful, but only scratching the surface.

      For cache hits and other counters, you're gonna have to go deeper than just sampling.

      • datadeft 6 hours ago

        How would you go about cache hits instead of using a sampling profiler? Would you use eBPF / BCC?

        • freeone3000 5 hours ago

          The article describes using strace and perf in the paragraph after.

  • ciupicri 17 hours ago

    I don't understand why is he running it under Docker when one of the touted benefits of Rust is that all you need to do is just copy & run the binary on the destination machine.

    • mcronce 15 hours ago

      I don't think that's a commonly touted benefit of Rust.

      You certainly can build a statically linked binary with musl libc (in many circumstances, at least), but it's not the default.

      The default is a binary that statically links all the Rust crates you pull in, but dynamically links glibc, the vdso, and several other common dependencies. It's also, IIRC, the default to dynamically link many other common dependencies like openssl if you pull them in, although I think it's common for the crates that wrap them to offer static linking as an option.

      • saghm 14 hours ago

        Quite a lot of things can get by without needing any other dynamic libraries other than libc; there are some alternatives to using openssl in Rust that are fairly common. There definitely are places I've seen containers used for building/deploying Rust due to the need for dynamically-linked dependencies, but I've also seen it a bit in places where they already are using containers for everything else, and using the same thing for all of their services regardless of whether they're in Rust, Node, Python can can make the process around deployment simpler rather than having a separate way of doing things for each language.

    • maxbond 17 hours ago

      They explain in a different blog post.

      > Now, you may wonder why profiling Rust code within Docker, an engine to run containerized applications. The main advantage of containers is that when configured properly, they provide isolation from the rest of your system, allowing to restrict access to resources (files, networking, memory, etc.).

      https://gendignoux.com/blog/2019/11/09/profiling-rust-docker...

      • saghm 14 hours ago

        I've always heard that containers don't actually provide that much by way of security, although I imagine that maybe it could provide some value as one of many layers of a more in-depth defense. Is what I've heard correct here, or am I maybe missing some nuance?

        • khafra 11 hours ago

          I think there's a somewhat fine distinction to be made, here. An analogy is chroot. It's not a "security tool," and there are many ways to break out of a chroot jail, but it does provide some isolation, and that does provide a measure of security. If you run an application inside a container, and you have that container well-configured, it's going to be harder for someone with an RCE on that application to affect the rest of your system than if you weren't running that application in a container.

          But a container does not provide a security boundary in the same sense as the security boundary between kernel mode and user mode.

          • CGamesPlay 4 hours ago

            Are you sure? I think it does provide a security boundary in a very analogous way to kernel/user mode. There are lots of ways to punch holes in that security boundary, but by default Docker provides, as far as I understood, a security boundary from the rest of the system and from other containers.

        • Ar-Curunir 3 hours ago

          The isolation the author wants is not for security, but just to get a reproducible benchmarking environment

    • duped 17 hours ago

      > one of the touted benefits of Rust is that all you need to do is just copy & run the binary on the destination machine

      is it? I've heard that about Go, not so much Rust. cargo doesn't statically link system dependencies by default, only other rust dependencies.

      • 01HNNWZ0MV43FF 17 hours ago

        Go statically links C deps? Like gtk?

        • password4321 14 hours ago

          I think the point is Go doesn't have C deps by default like Rust does.

          • jeroenhd 9 hours ago

            As I've found out trying to load a golang-built .so in an Alpine Python environment, Go makes many assumptions about the system it's running on, and one of them seems to be "if linux, then glibc". Something to do with environment variables being at a magical offset somewhere that caused a null dereference.

            Maybe it's been fixed since, but I wouldn't assume Go doesn't have any C dependencies.

          • m0shen 12 hours ago

            I thought it depended on libc by default?

            • mirashii 7 hours ago

              You might find https://mt165.co.uk/blog/static-link-go/ useful, but a quick tl;dr:

              Go binaries are statically linked by default if they do not call out to any C functions.

              Go will automatically switch to dynamic linking if you need to call out to C functions. There are some parts of the standard library that do this internally, and so many folks assume that go is dynamically linked by default.

    • 7 hours ago
      [deleted]
    • datadeft 6 hours ago

      Because this is a production like workflow in many companies and for performance measurements running in docker can have significant impact.

    • kristianp 13 hours ago

      He's running docker to make it easier to clean up and return to a clean slate. Profiling in itself doesn't require docker, I think it needlessly adds to the complexity of the task of profiling, unless its providing clear benefits.

    • vips7L 16 hours ago

      Except if the destination machine is an entirely different architecture or doesn’t include the same instructions as the compilation machine…

      • duped 16 hours ago

        Docker doesn't solve that problem

        • vips7L 5 hours ago

          I didn’t claim it did. I probably could have been clearer but there is no way that a touted benefit of rust is to be able to just drop a binary on a different machine.

        • dymk 12 hours ago

          It does. Docker will by default run images built for different architectures in a virtual machine.