Design Goals
Quantization code should keep the quantization format semantics separate from hardware-specific execution. This makes it easier to add new formats, reuse kernels across formats, and review platform-specific changes independently. Follow the architecture proposed in Quantization Modifications:- Config: parses model and runtime quantization parameters, validates supported options, and selects the proper scheme.
- Scheme: owns quantized weight creation, weight loading, post-processing, and quantized layer wiring for Linear, MoE, embedding, or other module types.
- Backend kernel: wraps hardware-specific execution, layout conversion, kernel selection, and kernel calls for GPU (CUDA/HIP/XPU), NPU, or other backends.
python/sglang/srt/layers/quantization/<method>/ and split schemes into schemes/.
Recommended File Layout
Use this layout for a quantization method that has multiple schemes or backend-specific execution paths:Adding or Refactoring a Quantization Method
- Define the config entry point and register it through
python/sglang/srt/layers/quantization/__init__.pywhen needed. - Add explicit scheme selection helpers such as
get_linear_schemeandget_moe_scheme. - Move layer-specific weight creation and weight loading into scheme classes.
- Move GPU (CUDA/HIP/XPU), NPU, or other hardware kernel calls into backend kernel modules.
- Keep Linear, MoE, embedding, and non-linear module handling explicit. Do not assign a Linear quantization method to a module type that needs different semantics.
- Preserve compatibility for existing quantized checkpoints and runtime flags.
- Add tests that cover both config parsing and execution paths touched by the change.
- PR #21126: splits AWQ schemes, weight initialization, and backend kernel calls.
- PR #26402: applies the same scheme/kernel split to GPTQ.
Tests and Validation
Quantization changes can affect both accuracy and performance. Include validation that matches the blast radius of the change. For Python-only structure changes:- Launch at least one representative model for each touched quantization method.
- Send a
/generaterequest and confirm the output path succeeds. - Run an accuracy sanity test if the change can affect numerics.
- Include warmup-aware benchmark results when the change affects kernel calls, layout conversion, or dispatch.
- Validate GPU changes on a supported GPU environment (NVIDIA, AMD, or Intel).
- Validate NPU changes on a supported Ascend environment.
- Include the exact model, quantization flag, backend flag, hardware, and command used in the PR description.
PR Checklist
Before requesting review, make sure the PR description includes:- The quantization method and backend paths changed.
- The issue, design proposal, or roadmap item the PR follows.
- Any compatibility notes for existing checkpoints or flags.
- Accuracy results when model outputs can change.
- Benchmark or profiling results when runtime performance can change.
- The exact local checks and model launch tests that were run.
