> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sglang.io/llms.txt
> Use this file to discover all available pages before exploring further.

# MiMo-V2-Flash

## Introduction

MiMo-V2-Flash is a Mixture-of-Experts (MoE) large language model developed by Xiaomi. It employs advanced architecture
with speculative decoding capabilities for accelerated inference. The model is optimized for high throughput and low
latency scenarios through PD disaggregation deployment.

This document demonstrates the deployment of MiMo-V2-Flash on Ascend NPUs using SGLang, including multi-node PD
disaggregation mode, feature configuration, and performance optimization.

This document is validated and written based on **SGLang v0.5.13**. The current model (MiMo-V2-Flash) is fully supported in
this version. To use the latest features (e.g., PD disaggregation, speculative decoding), it is recommended to use
v0.5.13 or a later version.

## Supported features

| Feature              | Example usage                                                                                                                                                                        |
| -------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| Tensor Parallelism   | `--tp-size 8` (prefill) or `--tp-size 16` (decode)                                                                                                                                   |
| Data Parallelism     | `--dp-size 2`                                                                                                                                                                        |
| Expert Parallelism   | `--moe-a2a-backend deepep \`<br />`--deepep-mode low_latency`                                                                                                                        |
| PD Disaggregation    | `--disaggregation-mode prefill \`<br />`--disaggregation-transfer-backend ascend`                                                                                                    |
| Quantization         | `--quantization modelslim`                                                                                                                                                           |
| NPU Graph            | enabled by default; disable with `--disable-cuda-graph`;<br />control range via `--cuda-graph-bs` or `--cuda-graph-max-bs`; e.g. `--cuda-graph-bs 1 2 4 8 12 16 20 24 28 32`         |
| Speculative Decoding | `--speculative-algorithm EAGLE \`<br />`--speculative-num-steps 3 \`<br />`--speculative-eagle-topk 1 \`<br />`--speculative-num-draft-tokens 4 \`<br />`--enable-multi-layer-eagle` |
| Overlap Schedule     | `export SGLANG_ENABLE_OVERLAP_PLAN_STREAM=0`                                                                                                                                         |
| DP LM Head           | `--enable-dp-lm-head`                                                                                                                                                                |
| DP Attention         | `--enable-dp-attention`                                                                                                                                                              |

<Note>
  The values in the **Example usage** column are for illustration only. Adjust them according to your hardware, deployment
  mode, and workload. For parameter details, see
  [Feature descriptions](/docs/hardware-platforms/ascend-npus/ascend_npu_optimization#feature-descriptions); for
  recommended configurations for each deployment scenario, see [Best practices](#best-practices).
</Note>

For feature compatibility and conflict information between features,
see [Feature Compatibility](/docs/hardware-platforms/ascend-npus/ascend_npu_optimization#feature-compatibility).

## Prerequisites

### Environment

Before following this tutorial, complete the environment setup in the documents below:

* [Ascend NPU Quickstart](/docs/hardware-platforms/ascend-npus/ascend_npu_quick_start) — the fastest way to get started.
  It walks you through launching the official container image, starting the SGLang server, and sending a test request.
  Recommended if you are new to SGLang on Ascend.
* [SGLang Installation with NPU Support](/docs/hardware-platforms/ascend-npus/ascend_npu) — the full installation guide.
  It covers the component version mapping (CANN, PyTorch adapter, Triton, kernels, etc.), building from source or from a
  Dockerfile, and recommended system settings (CPU power scheme, NUMA, swap). Use it when you need to install or customize
  the environment instead of using the official image.

### Model weights

<Warning>
  Before downloading model weights, check the model size to reserve enough disk space.
</Warning>

* [MiMo-V2-Flash-W8A8](https://modelers.cn/models/Modelers_Park/MiMo-V2-Flash-W8A8) (Quantized version)

Ensure the available device memory exceeds the model weight size before deployment. For optimal throughput and latency,
refer to the [best practice configurations](#best-practices) which may require additional nodes or cards.

It is recommended to download the model weights to a shared directory across multiple nodes.

## Installation

<Warning>
  The Docker image requires at least **30 GB** of free space. Ensure sufficient disk space before pulling images.
</Warning>

The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the
online platform. You can directly pull it.

Both **stable releases** and **daily builds** are available. The following command is based on the stable release tag.
For details, see [Docker image versions](/docs/hardware-platforms/ascend-npus/ascend_npu_faq#8-docker-image-versions-stable-release-vs-daily-build).

<Tabs>
  <Tab title="Atlas 800I A3">
    ```bash Command theme={null}
    docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3

    docker run -itd --shm-size=16g --name ${NAME} \
    --privileged=true --net=host \
    -v /var/queue_schedule:/var/queue_schedule \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --device=/dev/davinci0:/dev/davinci0  \
    --device=/dev/davinci1:/dev/davinci1  \
    --device=/dev/davinci2:/dev/davinci2  \
    --device=/dev/davinci3:/dev/davinci3  \
    --device=/dev/davinci4:/dev/davinci4  \
    --device=/dev/davinci5:/dev/davinci5  \
    --device=/dev/davinci6:/dev/davinci6  \
    --device=/dev/davinci7:/dev/davinci7  \
    --device=/dev/davinci8:/dev/davinci8  \
    --device=/dev/davinci9:/dev/davinci9  \
    --device=/dev/davinci10:/dev/davinci10  \
    --device=/dev/davinci11:/dev/davinci11  \
    --device=/dev/davinci12:/dev/davinci12  \
    --device=/dev/davinci13:/dev/davinci13  \
    --device=/dev/davinci14:/dev/davinci14  \
    --device=/dev/davinci15:/dev/davinci15  \
    --device=/dev/davinci_manager:/dev/davinci_manager \
    --device=/dev/hisi_hdc:/dev/hisi_hdc \
    --entrypoint=bash \
    quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-a3
    ```
  </Tab>

  <Tab title="Atlas 800I A2">
    ```bash Command theme={null}
    docker pull quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b

    docker run -itd --shm-size=16g --name ${NAME} \
    --privileged=true --net=host \
    -v /var/queue_schedule:/var/queue_schedule \
    -v /etc/ascend_install.info:/etc/ascend_install.info \
    -v /usr/local/sbin:/usr/local/sbin \
    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
    -v /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
    --device=/dev/davinci0:/dev/davinci0  \
    --device=/dev/davinci1:/dev/davinci1  \
    --device=/dev/davinci2:/dev/davinci2  \
    --device=/dev/davinci3:/dev/davinci3  \
    --device=/dev/davinci4:/dev/davinci4  \
    --device=/dev/davinci5:/dev/davinci5  \
    --device=/dev/davinci6:/dev/davinci6  \
    --device=/dev/davinci7:/dev/davinci7  \
    --device=/dev/davinci_manager:/dev/davinci_manager \
    --device=/dev/hisi_hdc:/dev/hisi_hdc \
    --entrypoint=bash \
    quay.io/ascend/sglang:v0.5.13.post1-cann9.0.0-910b
    ```
  </Tab>
</Tabs>

<Tip>
  * If the model weights have already been downloaded to a shared directory, use `-v` to mount the model path into the
    container, for example: `-v /path/to/models:/models`.
  * Replace `${NAME}` with your own container name or remove `--name` to use default name.
</Tip>

## Online service deployment

### Multi-node PD disaggregation deployment

PD disaggregation splits the prefill and decode stages onto separate nodes, reducing interference and improving
throughput for high-concurrency scenarios. This scenario is already covered in the best practice. For the complete, optimized
deployment commands and benchmark data, see
[MiMo-V2-Flash Best Practice — W8A8 24P PD Disaggregation On A3](/docs/hardware-platforms/ascend-npus/best_practice/mimo_v2_flash#pd-disaggregation).

## Functional verification

After the service is started, you can invoke the model by sending a prompt:

```shell theme={null}
# ============================================================
# Before running, update the following variables:
#   HOST: the server host address (e.g., localhost)
#   PORT: the server port number (e.g., 9903)
# ============================================================

curl http://${HOST}:${PORT}/generate \
    -H "Content-Type: application/json" \
    -d '{
        "text": "What is the capital of France?",
        "sampling_params": {
            "max_new_tokens": 64,
            "temperature": 0
        }
    }'
```

Expected result: an HTTP 200 response with the generated text containing "Paris".

Once the server prints `The server is fired up and ready to roll!` in the logs, it is ready to accept requests. For more
testing examples (Health Check, Generate, Chat Completions, and port usage guidance),
see [Testing the Service](/docs/hardware-platforms/ascend-npus/ascend_npu#testing-the-service).

## Accuracy evaluation

For accuracy evaluation methods and datasets, see [Accuracy Evaluation on Ascend NPU](/docs/hardware-platforms/ascend-npus/ascend_npu_accuracy_evaluation).

## Performance

For performance data and benchmark commands, see [Performance Testing on Ascend NPU](/docs/hardware-platforms/ascend-npus/ascend_npu_performance_testing).

## Best practices

### Best practice configuration reference

For complete optimal configurations with deployment scripts and benchmark commands, see the
[MiMo-V2-Flash Best Practice](/docs/hardware-platforms/ascend-npus/best_practice/mimo_v2_flash) page.

## Performance tuning

For the full list of supported features, see [Supported features](#supported-features). For detailed optimization
guidance, see [Optimization on Ascend NPU](/docs/hardware-platforms/ascend-npus/ascend_npu_optimization).

## FAQ

For common environment, installation, and general parameter issues, please refer to the [Ascend NPU FAQ](/docs/hardware-platforms/ascend-npus/ascend_npu_faq).
This section only covers model-specific issues.
