HiSparse: Hierarchical Sparse Attention

HiSparse reduces per-request GPU memory consumption during the decode phase by maintaining only a small “hot” KV buffer on GPU while keeping complete KV data in CPU pinned memory. Combined with PD disaggregation, it enables significantly higher decode concurrency.

Prerequisites: HiSparse works with models that use DeepSeek Sparse Attention (DSA) architectures (e.g., DeepSeek-V3.2, GLM-5.1) and DeepSeek V4. These models natively select a subset of tokens for attention, making it possible to keep only the top-k KV on GPU while storing the full KV in host memory — without accuracy loss. Additionally, HiSparse currently requires PD disaggregation mode and is enabled on the decode instance only.

Why HiSparse?

In long-context LLM inference, each decoding request holds a full-length KV cache on GPU, limiting the number of concurrent requests a decode instance can serve. HiSparse addresses this by:

Reducing GPU memory per request: Each request occupies only a fixed-size device buffer (e.g., 4KB tokens) instead of the full sequence length.
On-demand swap-in: A CUDA kernel dynamically loads the top-k most relevant KV entries from host memory based on attention scores.
Transparent to prefill: HiSparse is entirely a decode-side optimization; the prefill instance requires no changes.

Design Overview

Decode Workflow

Each decode step follows this flow:

Forward decode — generate the next token
Top-k selection — select the most relevant token positions via attention scores
Swap-in — the CUDA kernel loads top-k KV entries from host to device buffer:
- Short sequences (seq_len ≤ device_buffer_size): fast path, all KV already in buffer
- Long sequences: hit detection → LRU reordering → miss handling (host → device copy)
Decode attention — compute attention using the top-k device locations
Eager backup — asynchronously copy the previous token’s KV from device to host

PD Disaggregation Integration (Direct-to-Host)

In PD disaggregation mode, the prefill instance transfers KV cache directly into the decode instance’s host pool via RDMA, bypassing the GPU entirely on the decode side. This eliminates the transient GPU memory spike during KV transfer and removes the staging DMA step.

Prefill GPU  ──RDMA──▶  Decode Host Pool (CPU pinned memory)
                              │
                              ▼
                     alloc device buffer (4KB)
                              │
                              ▼
                     swap-in kernel (on-demand top-k)

For DeepSeek V4, the direct-to-host path writes only C4 KV into the decode host pool. The c4_indexer and C128 KV remain device-to-device transfers.

Server Arguments

Argument	Type / Default	Description
`—enable-hisparse`	flag; default: disabled	Enable HiSparse on the decode instance
`—hisparse-config`	JSON string	Configuration for HiSparse (see below)

HiSparse Config Parameters

Pass as a JSON string via --hisparse-config:

Parameter	Type / Default	Description
`top_k`	int	Number of topk entries
`device_buffer_size`	int	Number of token slots in the per-request GPU device buffer
`host_to_device_ratio`	int	Ratio of logical pool size to device pool size, determining host memory capacity
`swap_in_block_size`	int / 960	CUDA thread-block size for the HiSparse swap-in kernel

Example: --hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10, "swap_in_block_size": 960}'

Deployment

HiSparse currently requires PD disaggregation mode and is enabled only on the decode instance.

Prefill Instance

Command

python3 -m sglang.launch_server \
    --model-path /path/to/model \
    --trust-remote-code \
    --port 8000 --host 0.0.0.0 \
    --context-length 81920 \
    --chunked-prefill-size 65536 \
    --tp-size 8 --dp-size 8 --enable-dp-attention \
    --mem-fraction-static 0.85 \
    --disaggregation-mode prefill \
    --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
    --nnodes 1 --node-rank 0

Decode Instance (with HiSparse)

Command

python3 -m sglang.launch_server \
    --model-path /path/to/model \
    --trust-remote-code \
    --port 8000 --host 0.0.0.0 \
    --context-length 81920 \
    --tp-size 8 --dp-size 8 --enable-dp-attention \
    --mem-fraction-static 0.85 \
    --disable-radix-cache \
    --disaggregation-mode decode \
    --disaggregation-ib-device mlx5_0,mlx5_1,mlx5_2,mlx5_3 \
    --dist-init-addr 127.0.0.1:5757 \
    --nnodes 1 --node-rank 0 \
    --enable-hisparse \
    --hisparse-config='{"top_k": 2048, "device_buffer_size": 6144, "host_to_device_ratio": 10, "swap_in_block_size": 960}'

Note: For DSA models, --kv-cache-dtype defaults to auto, which resolves to fp8_e4m3 on SM100+ (Blackwell) and bfloat16 on older architectures. The DSA decode backend is automatically selected based on KV dtype (bfloat16 → flashmla_sparse, fp8_e4m3 → flashmla_kv). DSA backend flags apply only to DSA models; DeepSeek V4 uses its own dsv4 attention backend.

Benchmark

Command

python3 -m sglang.bench_serving \
    --backend sglang \
    --dataset-path /path/to/ShareGPT_V3_unfiltered_cleaned_split.json \
    --dataset-name random \
    --random-input 40000 \
    --random-output 20000 \
    --num-prompts 200 \
    --max-concurrency 200 \
    --request-rate 40 \
    --random-range-ratio 1.0 \
    --host 127.0.0.1 \
    --port 20000 \
    --model /path/to/model \
    --flush-cache \

Key Notes

The prefill instance does not need --enable-hisparse; it is unaware of HiSparse.
On the decode instance, --enable-hisparse and --hisparse-config are required for HiSparse.
For DSA models, --kv-cache-dtype bfloat16 uses flashmla_sparse, and --kv-cache-dtype fp8_e4m3 uses flashmla_kv.
For DeepSeek V4, DSA backend flags are not applicable. DeepSeek V4 uses the dsv4 attention backend and fp8_e4m3 KV cache by default.
host_to_device_ratio should be configured based on the host machine’s available memory. For example:
- ~1 TB host memory → host_to_device_ratio: 5
- ~2 TB host memory → host_to_device_ratio: 10

Acknowledgments

We would like to thank the SGLang team and community for the implementation and generous support, especially Zhiqiang Xie, Zhangheng Huang, Tingwei Huang, Shangming Cai, Teng Ma, and many others. We also thank the Alibaba Cloud TairKVCache team and the AntGroup SCT Inference team for their valuable contributions.

Basic Usage

Advanced Features

Supported Models

Developer Guide

References

HiSparse: Hierarchical Sparse Attention

Why HiSparse?

Design Overview

Decode Workflow

PD Disaggregation Integration (Direct-to-Host)

Server Arguments

HiSparse Config Parameters

Deployment

Prefill Instance

Decode Instance (with HiSparse)

Benchmark

Key Notes

Acknowledgments

​Why HiSparse?

​Design Overview

​Decode Workflow

​PD Disaggregation Integration (Direct-to-Host)

​Server Arguments

​HiSparse Config Parameters

​Deployment

​Prefill Instance

​Decode Instance (with HiSparse)

​Benchmark

​Key Notes

​Acknowledgments

Why HiSparse?

Design Overview

Decode Workflow

PD Disaggregation Integration (Direct-to-Host)

Server Arguments

HiSparse Config Parameters

Deployment

Prefill Instance

Decode Instance (with HiSparse)

Benchmark

Key Notes

Acknowledgments