Prerequisites
Building the native Metal kernels insgl-kernel requires the Apple
toolchain (clang++, the Metal framework headers, and xcrun). These ship
with the Xcode Command Line Tools, which cannot be installed via pip:
xcode-select -p && xcrun --find metal.
Install SGLang
You can install SGLang using one of the methods below.Install from Source
Launch of the Serving Engine
Launch the server with:SGLANG_USE_MLX=1- Enables the use of MLX as the SGLang runtime backend (if disabled, SGLang will fall back totorch.mps, which has less support)--disable-cuda-graph- Disables usage of CUDA graph, which is not relevant for Apple Metal.--disable-overlap-schedule- Disables overlap scheduling (enabled/not present by default) achieved using MLX’sasync_eval()SGLANG_MLX_USE_CUSTOM_ROPE=1- Enables the optional custom Metal RoPE kernel. It is disabled by default, so the MLX backend uses the standard RoPE path unless you opt in for A/B testing.
Quantization
The MLX backend supports two quantization paths on Apple Silicon:- Pre-quantized HF repos. Any
mlx-community/<model>-4bit(or-8bit) repo loads directly throughmlx_lm.load(...)— no extra flag needed. - On-the-fly quantization. For any fp16 model, pass
--quantization mlx_q4or--quantization mlx_q8to have sglang quantize the weights at load time viamlx_lm.utils.quantize_model(group size 64, the mlx-community default). The quantized weights stay in process memory; the on-disk model is untouched.Expected log line:The MLX backend silently ignores--quantization mlx_q4when the model is already quantized in its HF config (path 1), so the same flag is safe to pass either way.
Benchmarking with Requests
sglang.benchmark_one_batch calls the synchronous prefill/decode methods directly without going through the scheduler and the overlap code path.
sglang.benchmark_offline_throughput can toggle overlap scheduling as it uses the scheduler and the overlap code path by using the flag --disable-overlap-schedule.
