> ## Documentation Index > Fetch the complete documentation index at: https://docs.sglang.io/llms.txt > Use this file to discover all available pages before exploring further. # Sequence Parallelism Sequence parallelism splits long image or video latent sequences across GPUs. In SGLang Diffusion, the public controls are: * `--sp-degree`: total sequence parallel degree * `--ulysses-degree`: Ulysses parallel degree * `--ring-degree`: ring parallel degree The degrees must satisfy: ```text theme={null} sp_degree = ulysses_degree * ring_degree ``` Use SP when sequence length or video shape makes the DiT forward pass the bottleneck and the model supports sequence sharding. For latency-oriented multi-GPU Qwen/Wan deployments, also compare against CFG parallelism and FSDP; SP is not automatically the best multi-GPU setting for every model. ## Recommended Commands ### Two-GPU Sequence Parallelism This example uses two GPUs with `sp=2`, `ulysses=1`, and `ring=2`. ```bash theme={null} sglang serve \ --model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \ --num-gpus 2 \ --sp-degree 2 \ --ulysses-degree 1 \ --ring-degree 2 \ --port 8898 ``` ### Single-GPU Baseline Use an explicit single-GPU baseline before attributing a gain to sequence parallelism. ```bash theme={null} sglang serve \ --model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \ --num-gpus 1 \ --sp-degree 1 \ --ulysses-degree 1 \ --ring-degree 1 \ --port 8898 ``` ## Choosing The Degrees

Setting	Typical use	Notes
`--sp-degree 1`	Single-GPU or no sequence splitting	Use this as the baseline.
`--ulysses-degree N`	Ulysses-only sequence parallelism	When ring parallelism is not needed, keep `--ring-degree 1`.
`--ring-degree N`	Ring-based sequence splitting over long sequences	Keep `--sp-degree` equal to `ulysses\_degree \* ring\_degree`.

## Benchmarking Guidance When benchmarking SP, compare the same model, precision, resolution, frame count, step count, scheduler settings, prompt type, and output path. Report both stage latency and peak GPU memory; SP can reduce per-GPU memory while adding communication overhead. Useful metrics: * End-to-end latency * Denoising stage latency * Decoding stage latency * Peak GPU memory and peak allocated memory * Communication or runtime overhead when available ## Reference Benchmark The following numbers are a reference measurement for one setup. They are not a general promise for all Wan2.2 deployments. * Model: `Wan-AI/Wan2.2-TI2V-5B-Diffusers` * Hardware: two 48 GB RTX 40-series GPUs for sequence parallelism, one 48 GB RTX 40-series GPU for baseline * Sequence parallel config: `sp=2, ulysses=1, ring=2` (`u1r2`) * Baseline config: `sp=1, ulysses=1, ring=1` (`u1r1`) ### Stage Time Breakdown

Stage / Metric	`u1r2` (s)	`u1r1` baseline (s)	Speedup
InputValidation	0.1060	0.1029	0.97x
TextEncoding	1.3965	2.2261	1.59x
LatentPreparation	0.0002	0.0002	1.00x
TimestepPreparation	0.0003	0.0004	1.33x
Denoising	52.6358	71.6785	1.36x
Decoding	7.6708	13.4314	1.75x
Total	63.74	90.63	1.42x

### Memory Usage

Memory Metric	`u1r2` (GB)	`u1r1` baseline (GB)	Delta
Peak GPU Memory	20.07	27.40	-7.33
Peak Allocated	13.35	20.40	-7.05
Memory Overhead	6.72	7.00	-0.28
Overhead Ratio	33.5%	25.6%	+7.9pp

In this setup, end-to-end latency improved from `90.63s` to `63.74s` (`1.42x`) and peak GPU memory dropped by `7.33GB`. The overhead ratio increased, so future tuning should still check communication and runtime overhead on the target hardware.