基于vLLM-Ascend的DeepSeek-V3.2模型Atlas 800I A3单机混部部署实践 ​作者​昇腾实战派​知识地图​https://blog.csdn.net/Lumos_Lovegood/article/details/161601003背景概述本文档将介绍基于vLLM-Ascend的DeepSeek-V3.2模型在Atlas 800I A3上的单机混部部署实践包括支持的特性、特性配置、环境信息以及性能测试典型case基本信息软件版本设备信息组网形态总卡数数据格式0.18.0NPU: Atlas 800I A3-560T, HBM 128GCPU: Kunpeng 920 (80核-2900MHz)内存: 32根64G5200MHzOS: OpenEuler 22.03 LTS-SP4Atlas 800I A3单机8W8A8服务化配置低时延/高吞吐exportOMP_PROC_BINDfalseexportOMP_NUM_THREADS10exportHCCL_OP_EXPANSION_MODEAIVexportPYTORCH_NPU_ALLOC_CONFexpandable_segments:TrueexportLD_PRELOAD/usr/lib/aarch64-linux-gnu/libjemalloc.so.2:$LD_PRELOADexportVLLM_USE_V11exportHCCL_BUFFSIZE256exportASCEND_AGGREGATE_ENABLE1exportASCEND_TRANSPORT_PRINT1exportACL_OP_INIT_MODE1exportASCEND_A3_ENABLE1exportVLLM_NIXL_ABORT_REQUEST_TIMEOUT300000exportTASK_QUEUE_ENABLE1exportVLLM_ASCEND_ENABLE_MLAPO1exportVLLM_ASCEND_ENABLE_FLASHCOMM11vllm serve /mnt/share/weights/DeepSeek-V3.2-W8A8\--port8003\--data-parallel-size2\--tensor-parallel-size8\--seed1024\--served-model-name dsv3\--max-model-len67000\--max-num-batched-tokens4096\--max-num-seqs8\--trust-remote-code\--quantizationascend\--async-scheduling\--no-enable-prefix-caching\--enable-expert-parallel\--gpu-memory-utilization0.95\--compilation-config{cudagraph_mode:FULL_DECODE_ONLY, cudagraph_capture_sizes:[1,2,4,8,16,24,32,40,48]}\--speculative-config{num_speculative_tokens: 3, method:deepseek_mtp}\--tokenizer-mode deepseek_v32\--reasoning-parser deepseek_v3典型测试用例平均输入平均输出并行策略上下文长度Prefix Cache命中率总请求数最大并发数请求频率(req/s)163841024MLADP2TP8670000410163841024MLADP2TP86700001640.532768512MLADP2TP867000041032768512MLADP2TP8670000820.2655361024MLADP2TP8670000410655361024MLADP2TP867000082120482048MLADP2TP88000041020482048MLADP2TP880000164035001500MLADP2TP88000041035001500MLADP2TP8800001640测试命令参考aisbench官方测试指南。aisbench测试命令vllm-ascend社区官网特别声明以上配置均未开启Prefix Cache若实际生产环境需要使用该特性参考vLLM-Ascend社区参数指南开启–enable-prefix-caching