I've seen this over and over. One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution. Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.
If your algorithm is the equivalent of a couple of nested iterations, you have essentially three options: parallelize outer, inner, or both. In the vast majority of the cases I've run into, you want thread/task level parallelism on the outer loop (only), and if required, data/simd parallelism on the inner loop(s).
It's a rule of thumb, but it biases towards batches of work assigned to CPUs for a decent amount of time, allowing cache locality and pipelining to kick in. That's even before SIMD.
> One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution
Valid concern but I don't think this was the OP case though?
From my understanding, author gained the most benefits by dumbing down the generic rayon implementation to the same kind (thread-pool with task queues) but with different work-stealing algorithm.
> Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.
Work-stealing by definition kinda makes distributing the work "correctly" a difficult task, doesn't it?
Well, sure, in practice work stealing makes correct distribution difficult, but in theory, work stealing is to repair an incorrect work distribution, right?
If every CPU is 100% utilized without needing context switch (and running the right number of worker threads without switching those), then work stealing is not required.
But my comment is solidly "Rule of thumb". I claim no theoretical basis other than "Giving fewer longer tasks to fewer threads, (still >= number of worker threads), is better than giving more shorter ones"
The rule of thumb also keeps you from doing a lot of task switching. It isn't free enqueue and dequeue tasks. It is better if you have a million things to do to have a smaller set of tasks. Especially if the runtime for those tasks are somewhat uniform.
I'm confused by the before / after graph, this is how I'm reading the graphs:
The "before" graph shows a ~300ms wall-clock time which drops to ~150ms parallelised.
The "after" graph shows a ~27ms wall-clock time which drops to ~17ms when parallelised.
Isn't all the improvement still outside of the parallelisation then? There's an awful lot of discussion about parallelisation and the behaviour of schedulers given there wasn't any improvement of how parallelisable the end result was?
The reported benefit of using the custom threading implementation over rayon was 20% according to the article. So not nothing but not the biggest win. If they were able to rejig the algorithm so they could parralelize the outer loop there's probably a bigger win to be had.
Per the article's "Using the right profiling tools", I have gone through a similar profiling journey in Rust and found this tool very useful: https://superluminal.eu/rust/
Works on Windows, has a great GUI, shows pretty flame charts, really pleasant to use. Costs money but worth it.
I don't understand why is he running it under Docker when one of the touted benefits of Rust is that all you need to do is just copy & run the binary on the destination machine.
I don't think that's a commonly touted benefit of Rust.
You certainly can build a statically linked binary with musl libc (in many circumstances, at least), but it's not the default.
The default is a binary that statically links all the Rust crates you pull in, but dynamically links glibc, the vdso, and several other common dependencies. It's also, IIRC, the default to dynamically link many other common dependencies like openssl if you pull them in, although I think it's common for the crates that wrap them to offer static linking as an option.
Quite a lot of things can get by without needing any other dynamic libraries other than libc; there are some alternatives to using openssl in Rust that are fairly common. There definitely are places I've seen containers used for building/deploying Rust due to the need for dynamically-linked dependencies, but I've also seen it a bit in places where they already are using containers for everything else, and using the same thing for all of their services regardless of whether they're in Rust, Node, Python can can make the process around deployment simpler rather than having a separate way of doing things for each language.
> Now, you may wonder why profiling Rust code within Docker, an engine to run containerized applications. The main advantage of containers is that when configured properly, they provide isolation from the rest of your system, allowing to restrict access to resources (files, networking, memory, etc.).
I've always heard that containers don't actually provide that much by way of security, although I imagine that maybe it could provide some value as one of many layers of a more in-depth defense. Is what I've heard correct here, or am I maybe missing some nuance?
I think there's a somewhat fine distinction to be made, here. An analogy is chroot. It's not a "security tool," and there are many ways to break out of a chroot jail, but it does provide some isolation, and that does provide a measure of security. If you run an application inside a container, and you have that container well-configured, it's going to be harder for someone with an RCE on that application to affect the rest of your system than if you weren't running that application in a container.
But a container does not provide a security boundary in the same sense as the security boundary between kernel mode and user mode.
Are you sure? I think it does provide a security boundary in a very analogous way to kernel/user mode. There are lots of ways to punch holes in that security boundary, but by default Docker provides, as far as I understood, a security boundary from the rest of the system and from other containers.
As I've found out trying to load a golang-built .so in an Alpine Python environment, Go makes many assumptions about the system it's running on, and one of them seems to be "if linux, then glibc". Something to do with environment variables being at a magical offset somewhere that caused a null dereference.
Maybe it's been fixed since, but I wouldn't assume Go doesn't have any C dependencies.
Go binaries are statically linked by default if they do not call out to any C functions.
Go will automatically switch to dynamic linking if you need to call out to C functions. There are some parts of the standard library that do this internally, and so many folks assume that go is dynamically linked by default.
He's running docker to make it easier to clean up and return to a clean slate. Profiling in itself doesn't require docker, I think it needlessly adds to the complexity of the task of profiling, unless its providing clear benefits.
I didn’t claim it did. I probably could have been clearer but there is no way that a touted benefit of rust is to be able to just drop a binary on a different machine.
I've seen this over and over. One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution. Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.
If your algorithm is the equivalent of a couple of nested iterations, you have essentially three options: parallelize outer, inner, or both. In the vast majority of the cases I've run into, you want thread/task level parallelism on the outer loop (only), and if required, data/simd parallelism on the inner loop(s).
It's a rule of thumb, but it biases towards batches of work assigned to CPUs for a decent amount of time, allowing cache locality and pipelining to kick in. That's even before SIMD.
> One of the main issues pointed out by TFA is that there's too many small tasks allocated for parallel execution
Valid concern but I don't think this was the OP case though?
From my understanding, author gained the most benefits by dumbing down the generic rayon implementation to the same kind (thread-pool with task queues) but with different work-stealing algorithm.
> Rayon is not going to magically distribute your work perfectly, though it very often does a decent job.
Work-stealing by definition kinda makes distributing the work "correctly" a difficult task, doesn't it?
Well, sure, in practice work stealing makes correct distribution difficult, but in theory, work stealing is to repair an incorrect work distribution, right?
If every CPU is 100% utilized without needing context switch (and running the right number of worker threads without switching those), then work stealing is not required.
But my comment is solidly "Rule of thumb". I claim no theoretical basis other than "Giving fewer longer tasks to fewer threads, (still >= number of worker threads), is better than giving more shorter ones"
The rule of thumb also keeps you from doing a lot of task switching. It isn't free enqueue and dequeue tasks. It is better if you have a million things to do to have a smaller set of tasks. Especially if the runtime for those tasks are somewhat uniform.
For sure. Context switching tasks is certainly a lot cheaper than context switching threads, but it isn't free.
Larger grain size better.
I'm confused by the before / after graph, this is how I'm reading the graphs:
The "before" graph shows a ~300ms wall-clock time which drops to ~150ms parallelised.
The "after" graph shows a ~27ms wall-clock time which drops to ~17ms when parallelised.
Isn't all the improvement still outside of the parallelisation then? There's an awful lot of discussion about parallelisation and the behaviour of schedulers given there wasn't any improvement of how parallelisable the end result was?
The reported benefit of using the custom threading implementation over rayon was 20% according to the article. So not nothing but not the biggest win. If they were able to rejig the algorithm so they could parralelize the outer loop there's probably a bigger win to be had.
Per the article's "Using the right profiling tools", I have gone through a similar profiling journey in Rust and found this tool very useful: https://superluminal.eu/rust/
Works on Windows, has a great GUI, shows pretty flame charts, really pleasant to use. Costs money but worth it.
> Superluminal is the new industry-standard profiling tool.
Never heard of it and I do a lot of performance related work. Is Superluminal really that popular?
I don't think it's popular but it really worked great for me!
Sadly only for windows though
Superluminal is a sampling profiler for the most part. It works great for what it does, sure. But in the author's own words:
> So far, we’ve only used perf to record stack traces at a regular time interval. This is useful, but only scratching the surface.
For cache hits and other counters, you're gonna have to go deeper than just sampling.
How would you go about cache hits instead of using a sampling profiler? Would you use eBPF / BCC?
The article describes using strace and perf in the paragraph after.
I don't understand why is he running it under Docker when one of the touted benefits of Rust is that all you need to do is just copy & run the binary on the destination machine.
I don't think that's a commonly touted benefit of Rust.
You certainly can build a statically linked binary with musl libc (in many circumstances, at least), but it's not the default.
The default is a binary that statically links all the Rust crates you pull in, but dynamically links glibc, the vdso, and several other common dependencies. It's also, IIRC, the default to dynamically link many other common dependencies like openssl if you pull them in, although I think it's common for the crates that wrap them to offer static linking as an option.
Quite a lot of things can get by without needing any other dynamic libraries other than libc; there are some alternatives to using openssl in Rust that are fairly common. There definitely are places I've seen containers used for building/deploying Rust due to the need for dynamically-linked dependencies, but I've also seen it a bit in places where they already are using containers for everything else, and using the same thing for all of their services regardless of whether they're in Rust, Node, Python can can make the process around deployment simpler rather than having a separate way of doing things for each language.
They explain in a different blog post.
> Now, you may wonder why profiling Rust code within Docker, an engine to run containerized applications. The main advantage of containers is that when configured properly, they provide isolation from the rest of your system, allowing to restrict access to resources (files, networking, memory, etc.).
https://gendignoux.com/blog/2019/11/09/profiling-rust-docker...
I've always heard that containers don't actually provide that much by way of security, although I imagine that maybe it could provide some value as one of many layers of a more in-depth defense. Is what I've heard correct here, or am I maybe missing some nuance?
I think there's a somewhat fine distinction to be made, here. An analogy is chroot. It's not a "security tool," and there are many ways to break out of a chroot jail, but it does provide some isolation, and that does provide a measure of security. If you run an application inside a container, and you have that container well-configured, it's going to be harder for someone with an RCE on that application to affect the rest of your system than if you weren't running that application in a container.
But a container does not provide a security boundary in the same sense as the security boundary between kernel mode and user mode.
Are you sure? I think it does provide a security boundary in a very analogous way to kernel/user mode. There are lots of ways to punch holes in that security boundary, but by default Docker provides, as far as I understood, a security boundary from the rest of the system and from other containers.
The isolation the author wants is not for security, but just to get a reproducible benchmarking environment
> one of the touted benefits of Rust is that all you need to do is just copy & run the binary on the destination machine
is it? I've heard that about Go, not so much Rust. cargo doesn't statically link system dependencies by default, only other rust dependencies.
Go statically links C deps? Like gtk?
I think the point is Go doesn't have C deps by default like Rust does.
As I've found out trying to load a golang-built .so in an Alpine Python environment, Go makes many assumptions about the system it's running on, and one of them seems to be "if linux, then glibc". Something to do with environment variables being at a magical offset somewhere that caused a null dereference.
Maybe it's been fixed since, but I wouldn't assume Go doesn't have any C dependencies.
I thought it depended on libc by default?
You might find https://mt165.co.uk/blog/static-link-go/ useful, but a quick tl;dr:
Go binaries are statically linked by default if they do not call out to any C functions.
Go will automatically switch to dynamic linking if you need to call out to C functions. There are some parts of the standard library that do this internally, and so many folks assume that go is dynamically linked by default.
Because this is a production like workflow in many companies and for performance measurements running in docker can have significant impact.
He's running docker to make it easier to clean up and return to a clean slate. Profiling in itself doesn't require docker, I think it needlessly adds to the complexity of the task of profiling, unless its providing clear benefits.
Except if the destination machine is an entirely different architecture or doesn’t include the same instructions as the compilation machine…
Docker doesn't solve that problem
I didn’t claim it did. I probably could have been clearer but there is no way that a touted benefit of rust is to be able to just drop a binary on a different machine.
It does. Docker will by default run images built for different architectures in a virtual machine.