> SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.
> The potential speed improvement vs JAX for large training runs is over an order of magnitude
C is next to godliness.
Full tweet:
> SpaceX has almost finished writing V1.0 of an in-house AI training stack in C that exact-maps to 220k GB300s with 800G NICs, making heavy use of pipeline parallelism and getting as close to bare metal as possible.
> The potential speed improvement vs JAX for large training runs is over an order of magnitude