Making a parallel Rust workload 10x faster with (or without) Rayon

(gendignoux.com)

123 points | by lukastyrychtr 7 months ago ago

40 comments

I've seen this over and over. One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution. Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.

If your algorithm is the equivalent of a couple of nested iterations, you have essentially three options: parallelize outer, inner, or both. In the vast majority of the cases I've run into, you want thread/task level parallelism on the outer loop (only), and if required, data/simd parallelism on the inner loop(s).

It's a rule of thumb, but it biases towards batches of work assigned to CPUs for a decent amount of time, allowing cache locality and pipelining to kick in. That's even before SIMD.

[-]

cogman10 7 months ago

The rule of thumb also keeps you from doing a lot of task switching. It isn't free enqueue and dequeue tasks. It is better if you have a million things to do to have a smaller set of tasks. Especially if the runtime for those tasks are somewhat uniform.

[-]

mcronce 7 months ago

For sure. Context switching tasks is certainly a lot cheaper than context switching threads, but it isn't free.

menaerus 7 months ago

> One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution

Valid concern but I don't think this was the OP case though?

From my understanding, author gained the most benefits by dumbing down the generic rayon implementation to the same kind (thread-pool with task queues) but with different work-stealing algorithm.

> Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.

Work-stealing by definition kinda makes distributing the work "correctly" a difficult task, doesn't it?

[-]

jvanderbot 7 months ago

Well, sure, in practice work stealing makes correct distribution difficult, but in theory, work stealing is to repair an incorrect work distribution, right?

If every CPU is 100% utilized without needing context switch (and running the right number of worker threads without switching those), then work stealing is not required.

But my comment is solidly "Rule of thumb". I claim no theoretical basis other than "Giving fewer longer tasks to fewer threads, (still >= number of worker threads), is better than giving more shorter ones"

dboreham 7 months ago

Larger grain size better.

ModernMech 7 months ago

Per the article's "Using the right profiling tools", I have gone through a similar profiling journey in Rust and found this tool very useful: https://superluminal.eu/rust/

Works on Windows, has a great GUI, shows pretty flame charts, really pleasant to use. Costs money but worth it.

[-]

yazzku 7 months ago

Superluminal is a sampling profiler for the most part. It works great for what it does, sure. But in the author's own words:

> So far, we’ve only used perf to record stack traces at a regular time interval. This is useful, but only scratching the surface.

For cache hits and other counters, you're gonna have to go deeper than just sampling.

[-]

datadeft 7 months ago

How would you go about cache hits instead of using a sampling profiler? Would you use eBPF / BCC?

[-]

freeone3000 7 months ago

The article describes using strace and perf in the paragraph after.

datadeft 7 months ago

> Superluminal is the new industry-standard profiling tool.

Never heard of it and I do a lot of performance related work. Is Superluminal really that popular?

[-]

ModernMech 7 months ago

I don't think it's popular but it really worked great for me!

mgoetzke 7 months ago

Sadly only for windows though

[-]

msz-g-w 7 months ago

They are working on a Linux port.

xnorswap 7 months ago

I'm confused by the before / after graph, this is how I'm reading the graphs:

The "before" graph shows a ~300ms wall-clock time which drops to ~150ms parallelised.

The "after" graph shows a ~27ms wall-clock time which drops to ~17ms when parallelised.

Isn't all the improvement still outside of the parallelisation then? There's an awful lot of discussion about parallelisation and the behaviour of schedulers given there wasn't any improvement of how parallelisable the end result was?

[-]

meheleventyone 7 months ago

The reported benefit of using the custom threading implementation over rayon was 20% according to the article. So not nothing but not the biggest win. If they were able to rejig the algorithm so they could parralelize the outer loop there's probably a bigger win to be had.

imtringued 7 months ago

I think this article jumped the gun by abandoning Rayon too quickly.

ciupicri 7 months ago

I don't understand why is he running it under Docker when one of the touted benefits of Rust is that all you need to do is just copy & run the binary on the destination machine.

[-]

mcronce 7 months ago

I don't think that's a commonly touted benefit of Rust.

You certainly can build a statically linked binary with musl libc (in many circumstances, at least), but it's not the default.

The default is a binary that statically links all the Rust crates you pull in, but dynamically links glibc, the vdso, and several other common dependencies. It's also, IIRC, the default to dynamically link many other common dependencies like openssl if you pull them in, although I think it's common for the crates that wrap them to offer static linking as an option.

[-]

saghm 7 months ago

Quite a lot of things can get by without needing any other dynamic libraries other than libc; there are some alternatives to using openssl in Rust that are fairly common. There definitely are places I've seen containers used for building/deploying Rust due to the need for dynamically-linked dependencies, but I've also seen it a bit in places where they already are using containers for everything else, and using the same thing for all of their services regardless of whether they're in Rust, Node, Python can can make the process around deployment simpler rather than having a separate way of doing things for each language.

[-]

cstrahan 7 months ago

You don’t need non-libc dependencies to run into issues when copying a binary from one host to another: glibc uses symbol versioning, so (unless you jump through some hoops) a binary built one a machine with a newer glibc will fail to run on a machine with an older glibc.

I ran into this last week — I wanted to install the neovim text editor from the official binary distribution, but because the project’s CI distro has a more recent glibc installed, the executable would fail to run on Ubuntu 18.04.

There are plenty of great details spelled out here, in case you aren’t familiar with this kind of thing: https://stackoverflow.com/questions/57749127/how-can-i-speci...

maxbond 7 months ago

They explain in a different blog post.

> Now, you may wonder why profiling Rust code within Docker, an engine to run containerized applications. The main advantage of containers is that when configured properly, they provide isolation from the rest of your system, allowing to restrict access to resources (files, networking, memory, etc.).

https://gendignoux.com/blog/2019/11/09/profiling-rust-docker...

[-]

saghm 7 months ago

I've always heard that containers don't actually provide that much by way of security, although I imagine that maybe it could provide some value as one of many layers of a more in-depth defense. Is what I've heard correct here, or am I maybe missing some nuance?

[-]

khafra 7 months ago

I think there's a somewhat fine distinction to be made, here. An analogy is chroot. It's not a "security tool," and there are many ways to break out of a chroot jail, but it does provide some isolation, and that does provide a measure of security. If you run an application inside a container, and you have that container well-configured, it's going to be harder for someone with an RCE on that application to affect the rest of your system than if you weren't running that application in a container.

But a container does not provide a security boundary in the same sense as the security boundary between kernel mode and user mode.

[-]

CGamesPlay 7 months ago

Are you sure? I think it does provide a security boundary in a very analogous way to kernel/user mode. There are lots of ways to punch holes in that security boundary, but by default Docker provides, as far as I understood, a security boundary from the rest of the system and from other containers.

Ar-Curunir 7 months ago

The isolation the author wants is not for security, but just to get a reproducible benchmarking environment

[-]

saghm 7 months ago

Ah, I see. This sounds more similar to how I've seen docker used with Rust in other places. It's interesting that it never occurred to me to think of the word "isolation" as referring to anything beyond security, but it definitely makes sense to me now that you point it out.

duped 7 months ago

> one of the touted benefits of Rust is that all you need to do is just copy & run the binary on the destination machine

is it? I've heard that about Go, not so much Rust. cargo doesn't statically link system dependencies by default, only other rust dependencies.

[-]

01HNNWZ0MV43FF 7 months ago

Go statically links C deps? Like gtk?

[-]

password4321 7 months ago

I think the point is Go doesn't have C deps by default like Rust does.

[-]

jeroenhd 7 months ago

As I've found out trying to load a golang-built .so in an Alpine Python environment, Go makes many assumptions about the system it's running on, and one of them seems to be "if linux, then glibc". Something to do with environment variables being at a magical offset somewhere that caused a null dereference.

Maybe it's been fixed since, but I wouldn't assume Go doesn't have any C dependencies.

m0shen 7 months ago

I thought it depended on libc by default?

[-]

mirashii 7 months ago

You might find https://mt165.co.uk/blog/static-link-go/ useful, but a quick tl;dr:

Go binaries are statically linked by default if they do not call out to any C functions.

Go will automatically switch to dynamic linking if you need to call out to C functions. There are some parts of the standard library that do this internally, and so many folks assume that go is dynamically linked by default.

kristianp 7 months ago

He's running docker to make it easier to clean up and return to a clean slate. Profiling in itself doesn't require docker, I think it needlessly adds to the complexity of the task of profiling, unless its providing clear benefits.

7 months ago

[deleted]

datadeft 7 months ago

Because this is a production like workflow in many companies and for performance measurements running in docker can have significant impact.

vips7L 7 months ago

Except if the destination machine is an entirely different architecture or doesn’t include the same instructions as the compilation machine…

[-]

duped 7 months ago

Docker doesn't solve that problem

[-]

dymk 7 months ago

It does. Docker will by default run images built for different architectures in a virtual machine.

vips7L 7 months ago

I didn’t claim it did. I probably could have been clearer but there is no way that a touted benefit of rust is to be able to just drop a binary on a different machine.