Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline)#

This page reports Ring-SP performance for Wan2.2-TI2V-5B-Diffusers using:

  • Parallel config: sp=2, ulysses=1, ring=2 (short: u1r2)

  • Baseline config: sp=1, ulysses=1, ring=1 (short: u1r1)

Benchmark Setup#

  • Model: Wan2.2-TI2V-5B-Diffusers

  • GPU: 48G RTX40 series * 2

Online Serving#

Ring SP (u1r2)#

sglang serve \
  --model-type diffusion \
  --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \
  --port 8898

Baseline (u1r1)#

sglang serve \
  --model-type diffusion \
  --model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
  --num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \
  --port 8898

Benchmarks#

Benchmark Disclaimer#

These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns.

Stage Time Breakdown#

Stage / Metric

u1r2 (s)

u1r1 baseline (s)

Speedup

InputValidation

0.1060

0.1029

0.97x

TextEncoding

1.3965

2.2261

1.59x

LatentPreparation

0.0002

0.0002

1.00x

TimestepPreparation

0.0003

0.0004

1.33x

Denoising

52.6358

71.6785

1.36x

Decoding

7.6708

13.4314

1.75x

Total

63.74

90.63

1.42x

Memory Usage#

Memory Metric

u1r2 (GB)

u1r1 baseline (GB)

Delta

Peak GPU Memory

20.07

27.40

-7.33

Peak Allocated

13.35

20.40

-7.05

Memory Overhead

6.72

7.00

-0.28

Overhead Ratio

33.5%

25.6%

+7.9pp

Summary#

  • End-to-end latency improves from 90.63s to 63.74s (1.42x).

  • Main gains come from Denoising (1.36x) and Decoding (1.75x).

  • Absolute memory usage drops noticeably on Ring-SP (Peak GPU Memory -7.33GB, Peak Allocated -7.05GB).

  • Overhead ratio rises (+7.9pp), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain.