I feel like the title is a bit misleading. I think it should be something like "Using Rust's Standard Library from the GPU". The stdlib code doesn't execute on the GPU, it is just a remote function call, executed on the CPU, and then the response is returned. Very neat, but not the same as executing on the GPU itself as the title implies.
> For example, std::time::Instant is implemented on the GPU using a device timer
The code is running on the gpu there. It looks like remote calls are only for "IO", the compiled stdlib is generally running on gpu. (Going just from the post, haven't looked at any details)
Flip on the pedantic switch. We have std::fs, std::time, some of std::io, and std::net(!). While the `libc` calls go to the host, all the `std` code in-between runs on the GPU.
I think it fits quite well. Kind of like the rust standard lib runs on the cpu this does partially run on the gpu. The post does say they fall back on syscalls but for others there a native calls on the gpu itself such as Instant. The same way the standard lib uses syscalls on the cou instead of doing everything in process
Are there any details around how the round-trip and exchange of data (CPU<->GPU) is implemented in order to not be a big (partially-hidden) performance hit?
e.g. this code seems like it would entirely run on the CPU?
print!("Enter your name: ");
let _ = std::io::stdout().flush();
let mut name = String::new();
std::io::stdin().read_line(&mut name).unwrap();
But what if we concatenated a number to the string that was calculated on the GPU or if we take a number:
print!("Enter a number: ");
[...] // string number has to be converted to a float and sent to the GPU
// Some calculations with that number performed on the GPU
print!("The result is: " + &the_result.to_string()); // Number needs to be sent back to the CPU
Or maybe I am misunderstanding how this is supposed to work?
"We leverage APIs like CUDA streams to avoid blocking the GPU while the host processes requests.", so I'm guessing it would let the other GPU threads go about their lives while that one waits for the ACK from the CPU.
I once wrote a prototype async IO runtime for GLSL (https://github.com/kig/glslscript), it used a shared memory buffer and spinlocks. The GPU would write "hey do this" into the IO buffer, then go about doing other stuff until it needed the results, and spinlock to wait for the results to arrive from the CPU. I remember this being a total pain, as you need to be aware of how PCIe DMA works on some level: having your spinlock int written to doesn't mean that the rest of the memory write has finished.
Why are you assuming that this is intended to be performant, compared to code that properly segregates the CPU- and GPU-side? It seems clear to me that the latter will be a win.
I'm confused about this: As the article outlines well, Std Rust (over core) buys you GPOS-provided things. For example:
- file system
- network interfaces
- dates/times
- Threads, e.g. for splitting across CPU cores
The main relevant one I can think which applies is an allocator.
I do a lot of GPU work with rust: Graphics in WGPU, and Cuda kernels + cuFFT mediated by Cudarc (A thin FFI lib). I guess, running Std lib on GPU isn't something I understand. What would be cool is the dream that's been building for decades about parallel computing abstractions where you write what looks like normal single-threaded CPU code, but it automagically works on SIMD instructions or GPU. I think this and CubeCL may be working towards that? (I'm using Burn as well on GPU, but that's abstracted over)
Of note: Rayon sort of is that dream for CPU thread pools!
The GPU shader just calls back to the CPU which executes the OS-specific function and relays the answer to the GPU side. It might not make much sense on its own to have such strong coupling, but it gives you a default behavior that makes coding easier.
I feel like the title is a bit misleading. I think it should be something like "Using Rust's Standard Library from the GPU". The stdlib code doesn't execute on the GPU, it is just a remote function call, executed on the CPU, and then the response is returned. Very neat, but not the same as executing on the GPU itself as the title implies.
> For example, std::time::Instant is implemented on the GPU using a device timer
The code is running on the gpu there. It looks like remote calls are only for "IO", the compiled stdlib is generally running on gpu. (Going just from the post, haven't looked at any details)
I'm surprised this article doesn't provide a bigger list of calls that run on the gpu and further examples of what needs some cpu interop.
Flip on the pedantic switch. We have std::fs, std::time, some of std::io, and std::net(!). While the `libc` calls go to the host, all the `std` code in-between runs on the GPU.
I think it fits quite well. Kind of like the rust standard lib runs on the cpu this does partially run on the gpu. The post does say they fall back on syscalls but for others there a native calls on the gpu itself such as Instant. The same way the standard lib uses syscalls on the cou instead of doing everything in process
Author here! Flip on the pedantic switch, we agree ;-)
Are there any details around how the round-trip and exchange of data (CPU<->GPU) is implemented in order to not be a big (partially-hidden) performance hit?
e.g. this code seems like it would entirely run on the CPU?
But what if we concatenated a number to the string that was calculated on the GPU or if we take a number: Or maybe I am misunderstanding how this is supposed to work?"We leverage APIs like CUDA streams to avoid blocking the GPU while the host processes requests.", so I'm guessing it would let the other GPU threads go about their lives while that one waits for the ACK from the CPU.
I once wrote a prototype async IO runtime for GLSL (https://github.com/kig/glslscript), it used a shared memory buffer and spinlocks. The GPU would write "hey do this" into the IO buffer, then go about doing other stuff until it needed the results, and spinlock to wait for the results to arrive from the CPU. I remember this being a total pain, as you need to be aware of how PCIe DMA works on some level: having your spinlock int written to doesn't mean that the rest of the memory write has finished.
We use the cuda device allocator for allocations on the GPU via Rust's default allocator.
Why are you assuming that this is intended to be performant, compared to code that properly segregates the CPU- and GPU-side? It seems clear to me that the latter will be a win.
I'm confused about this: As the article outlines well, Std Rust (over core) buys you GPOS-provided things. For example:
The main relevant one I can think which applies is an allocator.I do a lot of GPU work with rust: Graphics in WGPU, and Cuda kernels + cuFFT mediated by Cudarc (A thin FFI lib). I guess, running Std lib on GPU isn't something I understand. What would be cool is the dream that's been building for decades about parallel computing abstractions where you write what looks like normal single-threaded CPU code, but it automagically works on SIMD instructions or GPU. I think this and CubeCL may be working towards that? (I'm using Burn as well on GPU, but that's abstracted over)
Of note: Rayon sort of is that dream for CPU thread pools!
The GPU shader just calls back to the CPU which executes the OS-specific function and relays the answer to the GPU side. It might not make much sense on its own to have such strong coupling, but it gives you a default behavior that makes coding easier.
How different is it from rust-gpu effort?
UPDATE: Oh, that's a post from maintainers or rust-gpu.
Can I execute FizzBuzz and DOOM on GPU?
Well you could already do doom for about 6 months now [1]. I haven't tested the nvidia side, but it ran okay on my RX 7700S in my framework laptop.
[1] https://github.com/jhuber6/doomgeneric