Ring SP Benchmark: Wan2.2-TI2V-5B (u1r2 vs Baseline)#
This page reports Ring-SP performance for Wan2.2-TI2V-5B-Diffusers using:
Parallel config:
sp=2, ulysses=1, ring=2(short:u1r2)Baseline config:
sp=1, ulysses=1, ring=1(short:u1r1)
Benchmark Setup#
Model:
Wan2.2-TI2V-5B-DiffusersGPU:
48G RTX40 series * 2
Online Serving#
Ring SP (u1r2)#
sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 2 --sp-degree 2 --ulysses-degree 1 --ring-degree 2 \
--port 8898
Baseline (u1r1)#
sglang serve \
--model-type diffusion \
--model-path /model/HuggingFace/Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--num-gpus 1 --sp-degree 1 --ulysses-degree 1 --ring-degree 1 \
--port 8898
Benchmarks#
Benchmark Disclaimer#
These benchmarks are provided for reference under one specific setup and command configuration. Actual performance may vary with model settings, runtime environment, and request patterns.
Stage Time Breakdown#
Stage / Metric |
|
|
Speedup |
|---|---|---|---|
InputValidation |
0.1060 |
0.1029 |
0.97x |
TextEncoding |
1.3965 |
2.2261 |
1.59x |
LatentPreparation |
0.0002 |
0.0002 |
1.00x |
TimestepPreparation |
0.0003 |
0.0004 |
1.33x |
Denoising |
52.6358 |
71.6785 |
1.36x |
Decoding |
7.6708 |
13.4314 |
1.75x |
Total |
63.74 |
90.63 |
1.42x |
Memory Usage#
Memory Metric |
|
|
Delta |
|---|---|---|---|
Peak GPU Memory |
20.07 |
27.40 |
-7.33 |
Peak Allocated |
13.35 |
20.40 |
-7.05 |
Memory Overhead |
6.72 |
7.00 |
-0.28 |
Overhead Ratio |
33.5% |
25.6% |
+7.9pp |
Summary#
End-to-end latency improves from
90.63sto63.74s(1.42x).Main gains come from
Denoising(1.36x) andDecoding(1.75x).Absolute memory usage drops noticeably on Ring-SP (
Peak GPU Memory -7.33GB,Peak Allocated -7.05GB).Overhead ratio rises (
+7.9pp), so future tuning can focus on reducing communication/runtime overhead while preserving the latency gain.