I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes.
I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens.
The post covers:
The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters).
Diagrams of the specific "All-to-All" communication steps.
How to handle the KV-cache bottleneck without exploding memory.
Happy to answer questions about the implementation or the communication cost analysis!
This is super helpful — most writeups skip over the actual communication steps, so seeing the All-to-All flow laid out makes it much clearer.
Curious from your experiments: at 1M+ context, does communication start dominating vs compute?
I keep seeing cases where bigger context windows are technically possible but don’t translate into better results unless the context is very structured, so I wonder where the real scaling limit ends up being in practice.
I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes.
I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens.
The post covers:
The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters).
Diagrams of the specific "All-to-All" communication steps.
How to handle the KV-cache bottleneck without exploding memory.
Happy to answer questions about the implementation or the communication cost analysis!
This is super helpful — most writeups skip over the actual communication steps, so seeing the All-to-All flow laid out makes it much clearer.
Curious from your experiments: at 1M+ context, does communication start dominating vs compute?
I keep seeing cases where bigger context windows are technically possible but don’t translate into better results unless the context is very structured, so I wonder where the real scaling limit ends up being in practice.