Sequence Parallelism - SGLang Documentation

Sequence parallelism splits long image or video latent sequences across GPUs. In SGLang Diffusion, the public controls are:

--sp-degree: total sequence parallel degree
--ulysses-degree: Ulysses parallel degree
--ring-degree: ring parallel degree

The degrees must satisfy:

sp_degree = ulysses_degree * ring_degree

Use SP when sequence length or video shape makes the DiT forward pass the bottleneck and the model supports sequence sharding. For latency-oriented multi-GPU Qwen/Wan deployments, also compare against CFG parallelism and FSDP; SP is not automatically the best multi-GPU setting for every model.

Recommended Commands

Two-GPU Sequence Parallelism

This example uses two GPUs with sp=2, ulysses=1, and ring=2.

sglang serve \
  --model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --num-gpus 2 \
  --sp-degree 2 \
  --ulysses-degree 1 \
  --ring-degree 2 \
  --port 8898

Single-GPU Baseline

Use an explicit single-GPU baseline before attributing a gain to sequence parallelism.

sglang serve \
  --model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --num-gpus 1 \
  --sp-degree 1 \
  --ulysses-degree 1 \
  --ring-degree 1 \
  --port 8898

Choosing The Degrees

Setting	Typical use	Notes
`—sp-degree 1`	Single-GPU or no sequence splitting	Use this as the baseline.
`—ulysses-degree N`	Ulysses-only sequence parallelism	When ring parallelism is not needed, keep `—ring-degree 1`.
`—ring-degree N`	Ring-based sequence splitting over long sequences	Keep `—sp-degree` equal to `ulysses_degree * ring_degree`.

Benchmarking Guidance

When benchmarking SP, compare the same model, precision, resolution, frame count, step count, scheduler settings, prompt type, and output path. Report both stage latency and peak GPU memory; SP can reduce per-GPU memory while adding communication overhead. Useful metrics:

End-to-end latency
Denoising stage latency
Decoding stage latency
Peak GPU memory and peak allocated memory
Communication or runtime overhead when available

Reference Benchmark

The following numbers are a reference measurement for one setup. They are not a general promise for all Wan2.2 deployments.

Model: Wan-AI/Wan2.2-TI2V-5B-Diffusers
Hardware: two 48 GB RTX 40-series GPUs for sequence parallelism, one 48 GB RTX 40-series GPU for baseline
Sequence parallel config: sp=2, ulysses=1, ring=2 (u1r2)
Baseline config: sp=1, ulysses=1, ring=1 (u1r1)

Stage Time Breakdown

Stage / Metric	`u1r2` (s)	`u1r1` baseline (s)	Speedup
InputValidation	0.1060	0.1029	0.97x
TextEncoding	1.3965	2.2261	1.59x
LatentPreparation	0.0002	0.0002	1.00x
TimestepPreparation	0.0003	0.0004	1.33x
Denoising	52.6358	71.6785	1.36x
Decoding	7.6708	13.4314	1.75x
Total	63.74	90.63	1.42x

Memory Usage

Memory Metric	`u1r2` (GB)	`u1r1` baseline (GB)	Delta
Peak GPU Memory	20.07	27.40	-7.33
Peak Allocated	13.35	20.40	-7.05
Memory Overhead	6.72	7.00	-0.28
Overhead Ratio	33.5%	25.6%	+7.9pp

In this setup, end-to-end latency improved from 90.63s to 63.74s (1.42x) and peak GPU memory dropped by 7.33GB. The overhead ratio increased, so future tuning should still check communication and runtime overhead on the target hardware.

​Recommended Commands

​Two-GPU Sequence Parallelism

​Single-GPU Baseline

​Choosing The Degrees

​Benchmarking Guidance

​Reference Benchmark

​Stage Time Breakdown

​Memory Usage