Skip to main content
This page reports Ring-SP performance on Ascend NPU with torch_npu==2.10.0.
  • Baseline config: ulysses=1, ring=1 (short: u1r1)
  • Ring-SP config: ulysses=1, ring=2 (short: u1r2)

Benchmark Setup

  • Model: Wan2.1-T2V-1.3B-Diffusers
  • Prompt: "a cat is playing piano"
  • Framework command: sglang generate
  • Runtime: torch_npu==2.10.0

Generate Commands

Baseline (u1r1)

sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "a cat is playing piano" --num-gpus 1 --ring-degree 1 \
    --save-output

Ring-SP (u1r2)

sglang generate --model-path /nas/disk1/Wan2.1-T2V-1.3B-Diffusers \
    --prompt "a cat is playing piano" --num-gpus 2 --ring-degree 2 \
    --save-output

Benchmarks

Benchmark Disclaimer These numbers are from one fixed setup and one prompt case. Actual performance may vary by model settings, environment, and workload.

Stage Time Breakdown

Stage / Metricu1r2 (s)u1r1 baseline (s)Speedup
InputValidation0.00030.00020.67x
TextEncoding3.59363.58201.00x
LatentPreparation0.00070.00557.86x
TimestepPreparation0.00080.00070.88x
Denoising121.2788239.25801.97x
Decoding13.868516.49691.19x
Total (Pixel data generated)141.86266.501.88x

Summary

  • With torch_npu==2.10.0, Ring-SP (u1r2) runs successfully on NPU for this case.
  • End-to-end generation time improves from 266.50s to 141.86s (1.88x).
  • The main gain comes from DenoisingStage (1.97x), while decoding also improves (1.19x).