手写算子优化 在上华为昇腾910 Ascend A3 上比 官方引擎vLLM-Ascend 快约 25% Ascend-super在 Ascend A3 上比 vLLM-Ascend 快约 25%我最近在一个从零实现的大模型推理引擎里把 DeepSeek-R1-Distill-Qwen-7B 的 Ascend A3 单 batch decode 性能推进到了一个很有意思的位置在同一台 Ascend A3、同一模型、同一 prompt、同样生成 128 tokens 的条件下Ascend-superdirect.so路径达到了约47.1 tok/s相比vLLM-Ascendbaseline 的37.639 tok/s快约25.1%相比torch_npubaseline 的34.627 tok/s快约36.0%。这不是调用 PyTorch、Transformers 或 vLLM 的结果而是项目里的 C / AscendCL / ACLNN 直接推理路径。项目地址GitHub:https://github.com/luogantt/LLM-inference-engine对应 tag:https://github.com/luogantt/LLM-inference-engine/tree/ascend-super-vs-vllm-47tok拉取代码gitclone https://github.com/luogantt/LLM-inference-engine.gitcdLLM-inference-enginegitcheckout ascend-super-vs-vllm-47tok测试环境硬件Ascend A3 模型DeepSeek-R1-Distill-Qwen-7B batch1 prompt黑格尔的哲学思想可以概括为 max_new_tokens128 max_seq800说明这里的 PyTorch baseline 使用torch_npu在 Ascend/NPU 上运行。它是最接近常规 torch 推理体验的对照组本文主结论比较的是同一块 Ascend A3 上的torch_npu、vLLM-Ascend和本项目 direct.so路径。结果汇总路径速度相对 Ascend-supertorch_npu baseline34.627 tok/sAscend-super 快约 36.0%vLLM-Ascend baseline37.639 tok/sAscend-super 快约 25.1%Ascend-super direct.so约 47.1 tok/s1.00x计算方式Ascend-super vs torch_npu: (47.1 / 34.627 - 1) * 100% 36.0% Ascend-super vs vLLM-Ascend: (47.1 / 37.639 - 1) * 100% 25.1%torch_npu baseline测试命令cd~/LLM-inference-engineexportASCEND_VISIBLE_DEVICES4python python_infer_ascend.py\--model./deepseek-r1-7b\--prompt黑格尔的哲学思想可以概括为\--max-new-tokens128\--max-seq800\--devicenpu:0\--dtypefloat16\21|teetorch_npu_128.log关键 log performance generated_tokens128 elapsed_s3.696 tokens_per_s34.627vLLM-Ascend baselinevLLM-Ascend在 A3 上需要使用匹配 A3 的安装包或源码构建。普通 A2 wheel 会报类似下面的错误Current device type: AscendDeviceType.A3 does not match the installed versions device type: AscendDeviceType.A2本次 baseline 已在 A3 版本vLLM-Ascend路径上跑通。测试脚本cd~/LLM-inference-enginecatvllm_ascend_offline_test.pyPY import time from vllm import LLM, SamplingParams MODEL ./deepseek-r1-7b PROMPT 黑格尔的哲学思想可以概括为 sampling SamplingParams(temperature0.0, max_tokens128) llm LLM( modelMODEL, tokenizerMODEL, trust_remote_codeTrue, dtypefloat16, max_model_len800, max_num_seqs1, gpu_memory_utilization0.90, enforce_eagerTrue, ) llm.generate([PROMPT], sampling) t0 time.perf_counter() outputs llm.generate([PROMPT], sampling) t1 time.perf_counter() out outputs[0].outputs[0] new_tokens len(out.token_ids) elapsed t1 - t0 print( generated text ) print(out.text) print() print( performance ) print(fgenerated_tokens{new_tokens}) print(felapsed_s{elapsed:.6f}) print(ftokens_per_s{new_tokens / elapsed:.3f}) PY运行命令cd~/LLM-inference-enginesource~/venvs/vllm-ascend/bin/activatemkdir-p~/ascend/logunsetASCEND_RT_VISIBLE_DEVICESexportASCEND_VISIBLE_DEVICES4exportPYTORCH_NPU_ALLOC_CONFmax_split_size_mb:256 python vllm_ascend_offline_test.py21|teevllm_ascend_offline_128.log关键 logProcessed prompts: 100%|██████████| 1/1 [00:0300:00, 3.40s/it, est. speed input: 2.35 toks/s, output: 37.65 toks/s] performance generated_tokens128 elapsed_s3.400694 tokens_per_s37.639生成结束后的 shutdown 阶段可能出现Engine core proc EngineCore died unexpectedly, shutting down client.这条日志出现在已经打印generated_tokens128和tokens_per_s37.639之后不影响这次性能数据。Ascend-super direct .so本项目的Ascend-super路径不走 PyTorch graph也不走 vLLM engine而是通过 Python tokenizer 调用 C 动态库python_infer.py - build/libllm_ascend.so - AscendCL / ACLNN编译cd~/LLM-inference-enginemake-fMakefile.cuda_lib clean-libmake-fMakefile.cuda_lib lib-ascendASCEND_HOME/usr/local/Ascend/cann-8.5.1推理命令cd~/LLM-inference-enginemkdir-p~/ascend/logexportASCEND_VISIBLE_DEVICES4exportASCEND_DEVICE_ID0exportASCEND_LOAD_WEIGHTSallexportASCEND_WEIGHT_LOAD_LOG0exportASCEND_TIME_LOG_FILE0exportASCEND_HOST_RAW_CACHE0exportASCEND_RUN_EMBED1exportASCEND_DIRECT_DECODEall_layers_refexportASCEND_REF_CACHE_WEIGHTS1exportASCEND_REF_CACHE_LOG0exportASCEND_REF_KV_CACHE1exportASCEND_REF_U16_WEIGHTS1exportASCEND_REF_FAST_DOT1exportASCEND_REF_DOT40exportASCEND_REF_NEON_DOT1exportASCEND_ATTN_BACKENDcpuexportASCEND_QKV_BACKENDaclnnexportASCEND_QKV_FUSE_WEIGHTS1exportASCEND_QKV_FALLBACK0exportASCEND_QKV_LOG0exportASCEND_MLP_BACKENDaclnnexportASCEND_MLP_FUSE_GATE_UP1exportASCEND_MLP_FALLBACK0exportASCEND_MLP_LOG0exportASCEND_ATTN_PROJ_BACKENDaclnnexportASCEND_ATTN_PROJ_FALLBACK0exportASCEND_ATTN_PROJ_LOG0exportASCEND_LM_HEAD_BACKENDaclnnexportASCEND_LM_HEAD_FALLBACK0exportASCEND_LM_HEAD_LOG0exportASCEND_ACLNN_CUBE_MATH_TYPE0exportASCEND_REF_LINEAR_THREADS16exportASCEND_REF_ATTN_LINEAR_THREADS16exportASCEND_REF_ATTN_THREADS16exportASCEND_REF_ATTN_THREAD_MIN_SEQ32exportASCEND_REF_MLP_THREADS24exportASCEND_REF_DOWN_THREADS24exportASCEND_LM_HEAD_THREADS16exportASCEND_REF_PROFILE_LAYERS0exportASCEND_REF_PROFILE_TOKEN_LIMIT0python python_infer.py\--model./deepseek-r1-7b\--lib./build/libllm_ascend.so\--prompt黑格尔的哲学思想可以概括为\--max-new-tokens128\--max-seq800\--tokenizer-backend tokenizers\--no-chat-template\21|teeascend_super_128.log关键 log[Ascend][time] decode all_layers_ref finished, token102989, pos117, elapsed_ms21.191644 [Ascend][time] decode all_layers_ref finished, token109732, pos118, elapsed_ms21.171014 [Ascend][time] decode all_layers_ref finished, token54926, pos119, elapsed_ms21.191515 [Ascend][time] decode all_layers_ref finished, token100116, pos120, elapsed_ms21.187475 [Ascend][time] decode all_layers_ref finished, token9370, pos121, elapsed_ms21.103363 [Ascend][time] decode all_layers_ref finished, token104380, pos122, elapsed_ms21.098643 [Ascend][time] decode all_layers_ref finished, token104734, pos123, elapsed_ms20.984602 [Ascend][time] decode all_layers_ref finished, token101036, pos124, elapsed_ms21.112034 [Ascend][time] decode all_layers_ref finished, token26850, pos125, elapsed_ms21.102433 [Ascend][time] decode all_layers_ref finished, token101140, pos126, elapsed_ms21.126373 [Ascend][time] decode all_layers_ref finished, token3837, pos127, elapsed_ms21.154024 [Ascend][time] decode all_layers_ref finished, token99720, pos128, elapsed_ms21.195795 [Ascend][time] decode all_layers_ref finished, token85106, pos129, elapsed_ms21.228775 [Ascend][time] decode all_layers_ref finished, token100692, pos130, elapsed_ms21.235385 [Ascend][time] decode all_layers_ref finished, token104734, pos131, elapsed_ms21.313266 [Ascend][time] decode all_layers_ref finished, token109151, pos132, elapsed_ms21.418098 [Ascend][time] decode all_layers_ref finished, token33108, pos133, elapsed_ms21.233564 [Ascend][time] decode all_layers_ref finished, token100466, pos134, elapsed_ms21.346556 [Ascend][time] decode all_layers_ref finished, token1773, pos135, elapsed_ms21.434498 [Ascend][time] decode all_layers_ref finished, token100220, pos136, elapsed_ms21.227135最后一个 token 的粗略换算1000 / 21.227135 47.11 tok/s为什么会快这个优化不是靠简单换框架而是沿着 decode 热路径做减法权重常驻设备侧减少重复加载和格式转换。QKV 使用 fused weight一次 ACLNN matmul 输出 Q/K/V减少 matmul 次数。MLP gate/up 融合降低 decode 阶段的小 batch 调度开销。lm_head 使用 ACLNN argmax 路径避免把完整 logits 转成 Python / torch 侧张量。KV cache、U16 权重缓存、host buffer 复用降低每 token 的内存分配和拷贝成本。单 batch decode 场景下避开通用框架调度层把注意力和残差路径尽可能压到低开销实现。vLLM-Ascend 是优秀的通用推理框架强项在服务化、调度、多并发、PagedAttention 和生态集成而Ascend-super这条路径更像是针对单 batch decode 的极限实验把通用性让位给直接、短路径和可控的算子调度。如何复现模型下载如果本地没有模型可以用仓库里的下载脚本cd~/LLM-inference-engine pipinstall-Umodelscope python download_model.py\--sourcemodelscope\--modeldeepseek-ai/DeepSeek-R1-Distill-Qwen-7B\--local-dir ./deepseek-r1-7b或者使用 HuggingFacepipinstall-Uhuggingface_hub python download_model.py\--sourcehuggingface\--modeldeepseek-ai/DeepSeek-R1-Distill-Qwen-7B\--local-dir ./deepseek-r1-7b结论在这组实测里Ascend-superdirect.so路径已经超过了常规torch_npubaseline也超过了vLLM-Ascendbaselinetorch_npu baseline: 34.627 tok/s vLLM-Ascend baseline: 37.639 tok/s Ascend-super direct .so: 约 47.1 tok/s也就是说在 DeepSeek-R1-Distill-Qwen-7B、Ascend A3、单 batch、128 tokens decode 这个具体场景下一个从零手写的 AscendCL / ACLNN 推理路径已经可以比 vLLM-Ascend 快约一个四分之一的量级也就是约 25%。下一步目标很直接继续压缩 decode 热路径把 47 tok/s 推到50 tok/s。