Visualizing DeepSpeed Ulysses: Sequence Parallelism for 1M Context Windows

(darshanfofadiya.com)

1 points | by DARSHANFOFADIYA 6 hours ago ago

2 comments

I've been working on optimizing training for long-context models (70B+) and found that while Tensor Parallelism is well-documented, the newer "Unified" Sequence Parallelism techniques (like DeepSpeed Ulysses) are often treated as black boxes.

I wrote this deep dive to visualize exactly how we shard the Q, K, V projections and how the All-to-All communication primitives work during the attention step to handle 1M+ tokens.

The post covers:

The architectural difference between Ring Attention and Ulysses (and why Ulysses often wins on H100 clusters).

Diagrams of the specific "All-to-All" communication steps.

How to handle the KV-cache bottleneck without exploding memory.

Happy to answer questions about the implementation or the communication cost analysis!

ClaireGz 6 hours ago

This is super helpful — most writeups skip over the actual communication steps, so seeing the All-to-All flow laid out makes it much clearer.

Curious from your experiments: at 1M+ context, does communication start dominating vs compute?

I keep seeing cases where bigger context windows are technically possible but don’t translate into better results unless the context is very structured, so I wonder where the real scaling limit ends up being in practice.