Nightly Precision Regression Testing
Overview
The nightly precision regression framework detects silent numerical regressions in the SGLang serving engine by comparing per-layer hidden states between consecutive runs. It runs as a nightly CI job on 8×H200 GPUs and can also be invoked locally for development and debugging. The framework operates on a rolling-baseline model:- Baseline creation or comparison: Launch the server, send a fixed prompt, dump per-layer hidden states to disk. If a previous baseline exists, compare the new tensors against it using the SGLang tensor comparator. If the comparison passes, the new tensors become the updated baseline.
- On the first run (or when the capture shape changes), the dumped tensors are saved as a new baseline with no comparison.
SGLANG_PRECISION_HF_REPO is unset.
How It Works
Step-by-step flow
Key components
| Component | File | Purpose |
|---|---|---|
| Test entry point | test/registered/debug_utils/test_nightly_precision_regression.py | Orchestrates server launch, dump, compare, and reporting |
| HF baseline store | python/sglang/test/precision_baseline_store.py | Push / fetch / prune baselines on a HuggingFace dataset |
| Tensor comparator | python/sglang/srt/debug_utils/comparator/ | Compares two directories of .pt tensors, emits JSONL report |
| Dumper infrastructure | python/sglang/srt/debug_utils/dumper.py | Captures per-layer hidden states at runtime |
| CI workflow | .github/workflows/nightly-test-nvidia.yml | Schedules the nightly job on 8×H200 |
What Gets Dumped and Compared
Strided layer capture
Not every layer is dumped — the framework uses a strided capture to reduce I/O and storage overhead. By default, it captures:- Layer 0 (always)
- The last layer (always)
- Every 8th layer in between (configurable via
LAYER_CAPTURE_STRIDE)
config.json (num_hidden_layers or num_layers). If resolution fails, all layers are captured as a safe fallback.
The dumper filter is built dynamically as a regex matching only the selected layer indices, e.g.:
Decode-path verification
The test generates 2 tokens withignore_eos=True to ensure the model’s decode path is exercised. After the dump, _assert_decode_captured() verifies that tensors from the decode step were actually captured (not just prefill). If only prefill tensors are found, the test fails immediately — this catches misconfigurations where --max-total-tokens is too low for the decode loop to run.
Comparator
The comparator computes relative differences (rel_diff) for each tensor and checks them against a configurable threshold (default 1e-3). For tensor-parallel models, the --override-dims flag tells the comparator how to reduce across TP ranks before comparing:
Capture signature
Acapture_signature (SHA-1 hash of schema version, max_tokens, ignore_eos, TP size, and dumper filter) is computed per run. The HF store uses this signature during fetch to ensure only baselines with an identical capture shape are considered. If the signature changes (e.g. you add layers to the capture set or change TP), the framework establishes a fresh baseline instead of erroring on incompatible tensors.
Environment Variables
| Variable | Default | Description |
|---|---|---|
SGLANG_PRECISION_MODELS | zai-org/GLM-5.1-FP8 | Comma-separated HuggingFace model IDs to test |
SGLANG_PRECISION_BASELINE_DIR | /tmp/sglang_precision_baselines | Local directory for baseline tensors |
SGLANG_PRECISION_DIFF_THRESHOLD | 1e-3 | Per-tensor relative diff threshold |
SGLANG_PRECISION_FORCE_UPDATE | 0 | Set to 1 to skip comparison and unconditionally refresh baseline |
SGLANG_PRECISION_COMMIT | (auto-detected from git) | Override the sglang commit SHA tagged on push |
SGLANG_PRECISION_HF_REPO | (required) | HuggingFace dataset repo for cross-runner baseline storage |
SGLANG_PRECISION_HF_REVISION | main | Branch/revision of the HF dataset |
HF_TOKEN | (required in CI) | HuggingFace token with write access to the dataset |
CI Integration
Workflow job
The nightly jobnightly-test-precision-8-gpu-h200 is defined in .github/workflows/nightly-test-nvidia.yml and runs on an 8-GPU H200 runner. It is included in the nightly suite via test/run_suite.py.
Key CI configuration:
Required GitHub secrets/variables
| Name | Type | Purpose |
|---|---|---|
SGLANG_PRECISION_HF_REPO | Repository variable | HF dataset repo ID (e.g. org/sglang-precision-baselines) — required, the test errors if unset |
SGLANG_PRECISION_HF_REVISION | Repository variable (optional) | Dataset branch (defaults to main) |
HF_TOKEN_PRECISION_STORE | Repository secret | HF token with write access to the dataset |
GitHub Step Summary
When running in CI, the test writes a Markdown table to the GitHub Actions job summary showing each model’s status (PASSED, FAILED, BASELINE_ESTABLISHED, or ERROR).
HF Dataset Storage Layout
Baselines are organized in the HF dataset as:manifest.jsonl tracks all runs with one JSON object per line. Each manifest row carries a capture_signature field so that fetch selects only baselines with a matching capture shape.
The prune_old_runs() function (callable manually) retains daily runs for 30 days and keeps one run per week beyond that window.
How to Add a New Model
Option A: Add to the default model list (CI)
Edit the default intest/registered/debug_utils/test_nightly_precision_regression.py:
SGLANG_PRECISION_MODELS environment variable in the CI workflow to override the default.
Option B: Run locally for a specific model
Step-by-step: adding a model to the nightly CI
-
Verify the model works with the dumper. Run locally first to ensure hidden states are captured correctly:
-
Run a comparison pass (remove
FORCE_UPDATE):This should reportPASSEDif the engine is numerically stable for the model. -
Set the tensor-parallelism size. If the model requires TP > 1, the test harness defaults to
tp_size=8for all models. To customize, modify theModelLaunchSettingsconstruction in the test or pass extra server arguments: -
Adjust the diff threshold if needed. FP8 or quantized models may exhibit larger numerical differences. Set
SGLANG_PRECISION_DIFF_THRESHOLDto an appropriate value (e.g.,1e-2for FP8). -
Add to the default model list or configure
SGLANG_PRECISION_MODELSin the CI workflow.
Considerations for model-specific adjustments
| Concern | How to handle |
|---|---|
| TP size != 8 | Override tp_size in ModelLaunchSettings or add model-specific logic |
| Quantized models (FP8, GPTQ) | Loosen SGLANG_PRECISION_DIFF_THRESHOLD (e.g., 1e-2) |
| Model needs extra server args | Pass them via ModelLaunchSettings(model, extra_args=["--quantization", "fp8"]) |
| Model needs different prompt | Modify PROMPT constant or make it model-configurable |
| MoE models with TP partial sums | Already handled by --override-dims (bs h[tp:partial]) |
| Fewer/more capture layers | Adjust LAYER_CAPTURE_STRIDE (default 8); set lower for smaller models |
| Decode not captured | Ensure --max-total-tokens is well above the scheduler’s decode reservation (default 512); the test uses 4096 |
Running Locally
Prerequisites
- SGLang installed in development mode
- GPUs matching the model’s requirements
huggingface_hubinstalled- A HuggingFace dataset for baseline storage and a write-capable
HF_TOKEN. The HF store is mandatory —SGLANG_PRECISION_HF_REPOmust be set or the test will error at startup. This is because the nightly CI runners are ephemeral (no persistent local disk), so baselines must survive across runs via the HF dataset. There is currently no local-only fallback.
Quick local test
Force-refresh a baseline
Interpreting Results
Status codes
| Status | Meaning |
|---|---|
BASELINE_ESTABLISHED | No prior baseline with a matching signature existed; today’s tensors saved as the new baseline |
PASSED | All per-layer hidden states are within the diff threshold; baseline updated |
FAILED | One or more layers exceeded the diff threshold, or 0 layers were compared (baseline/target mismatch); diagnostic data pushed to HF |
ERROR | Server launch, inference, or comparison encountered an unexpected error |
Output example
When a failure is detected
- The comparator output is saved to
/tmp/nightly_precision_<model>_*.log - The failing tensors and comparator report are pushed to the HF dataset with
pass_label="failed"for offline diagnosis - The GitHub Step Summary includes the failure details
- The CI job exits with a non-zero status
Baseline Management
Local baselines
Baselines are stored at:baseline_meta.json next to the tensors records the timestamp and commit that produced the baseline.
HF dataset baselines
- Fetch: At test start, if no local baseline exists, the latest signature-matched baseline is downloaded from the HF dataset.
- Push: After each run, tensors and metadata are uploaded to the dataset.
- Prune: Use
prune_old_runs()to garbage-collect old baselines (keeps 30 days of daily runs, one per week after that).
Refreshing a stale baseline
If an intentional numerical change (e.g., kernel optimization, model refactor) causes a comparison failure:- Verify the change is intentional
- Set
SGLANG_PRECISION_FORCE_UPDATE=1and run the test once to establish a new baseline - Commit any necessary threshold adjustments
capture_signature will differ and the framework automatically establishes a fresh baseline — no manual intervention needed.
Known Limitations
Baseline drift
The framework uses a rolling baseline: every successful comparison updates the baseline to the current run’s tensors. This means the reference shifts forward each day. While individual day-to-day diffs stay within the configured threshold, tiny numerical differences can accumulate over time, causing the baseline to silently drift away from the original golden values. Implications:- The framework detects regressions (a sudden, large numerical change between consecutive runs), not absolute accuracy relative to a fixed reference.
- Over weeks or months, the cumulative drift may become significant enough to mask a real regression that happened gradually, or to cause a false-positive failure when the drift eventually crosses the threshold.
- Periodically re-establish a fresh anchor baseline from a known-good reference commit.
- Track the cumulative drift in the manifest metadata and alert when it exceeds a long-term budget.
- Compare against a fixed “epoch” baseline in addition to the rolling one.
No local-only mode
The test requires a HuggingFace dataset (SGLANG_PRECISION_HF_REPO) and a write-capable HF_TOKEN. There is no local-only fallback. This is by design — CI runners have no persistent local disk, so the HF dataset is the only way to carry baselines across runs. If you need to run the test locally, you must set up a HF dataset (even a private one) and provide the corresponding token.
File Reference
| File | Role |
|---|---|
test/registered/debug_utils/test_nightly_precision_regression.py | Main test — server lifecycle, dump, compare, report |
python/sglang/test/precision_baseline_store.py | HF dataset store — push, fetch, prune baselines |
python/sglang/srt/debug_utils/comparator/ | Tensor comparison engine |
python/sglang/srt/debug_utils/dumper.py | Runtime hidden-state capture |
.github/workflows/nightly-test-nvidia.yml | CI workflow definition |
test/run_suite.py | Test suite registration (includes nightly-precision-8-gpu-h200) |
