Introduction
The GLM (General Language Model) series is an open-source bilingual large language model family jointly developed by the KEG Laboratory of Tsinghua University and Zhipu AI. This series of models has performed outstandingly in the field of Chinese NLP with its unique unified pre-training framework and bilingual capabilities. GLM-5 adopts the DeepSeek-V3/V3.2 architecture, including the sparse attention (DSA) and multi-token prediction (MTP). Ascend supports GLM-5 with 0Day based on the SGLang inference framework, achieving low-code seamless enablement and compatibility with the mainstream distributed parallel capabilities within the current SGLang framework. We welcome developers to download and experience it.Environment Preparation
Model Weight
GLM-5.0(BF16 version): Download model weight.GLM-5.0-w4a8(Quantized version without mtp): Download model weight.- You can use msmodelslim to quantify the model naively.
Installation
The dependencies required for the NPU runtime environment have been integrated into a Docker image and uploaded to the online platform. You can directly pull it.Command
Best Practices
Note: Using this image for best practices, you need to update transformers to version 5.3.0Deployment
Single-node Deployment
- Quantized model
glm5_w4a8can be deployed on 1 Atlas 800 A3 (64G × 16) .
Launch Server
Multi-node Deployment
GLM-5-bf16: require at least 2 Atlas 800 A3 (64G × 16).
Launch Multi-node Server
