CANN/asc-devkit Matmul基础API

发布时间：2026/6/20 11:04:39

Matmul Basic API Computation Based on Static Tensor Programming【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkitOverviewThis sample implements multi-core matrix multiplication computation based on the static Tensor programming paradigm.Supported Products and CANN Software VersionsProductCANN Software VersionAscend 950PR/Ascend 950DT CANN 9.1.0Atlas A3 Training Series Products/Atlas A3 Inference Series Products CANN 9.0.0Atlas A2 Training Series Products/Atlas A2 Inference Series Products CANN 9.0.0Directory Structure├── matmul_basic_api │ ├── scripts │ │ ├── gen_data.py // Script for generating input data and golden data │ │ └── verify_result.py // Golden value comparison file │ ├── CMakeLists.txt // Build project file │ ├── data_utils.h // Data read and write functions │ ├── matmul_basic_api.asc // Ascend C sample implementation invocation sample │ └── README.md // Sample documentationSample DescriptionSample Functionality:This sample uses Ascend C basic API to implement a basic matrix multiplication (Matmul) kernel function. The matrix multiplication formula is as follows: $$ C A * B $$ Where matrix A has shape[M, K], matrix B has shape[K, N], and output matrix C has shape[M, N]. For each elementC[m, n]in the output matrix C, the product of rowmof matrix A and columnnof matrix B along the K axis accumulates. In matrix multiplication,M directionrefers to the row direction of matrix C,N directionrefers to the column direction of matrix C, andK directionrefers to the inner dimension (accumulation dimension) of matrix C multiplication.Sample Specifications:This sample uses parametersM 256, N 256, K 64, with both input and output inhalftype andNDformat. The sample launches 2 cores to complete the computation, with each core responsible for 128 rows in the M axis direction and all 256 columns in the N axis direction of the output matrix C:Core 0 computes rows0~127of matrix C.Core 1 computes rows128~255of matrix C.The input and output specifications are shown in the following table:Sample Type(OpType)MatmulSample Inputnameshapedata typeformatA[M, K]halfNDB[K, N]halfNDSample OutputC[M, N]halfNDKernel Function Namemmad_customSample Implementation:Kernel-side Overall Approachmmad_customis a__global____cube__kernel function, which indicates that this function runs on the Cube computation unit of AI Core, primarily used for matrix computation.The sample uses the static Tensor programming method and createsLocalTensorthroughLocalMemAllocator.CUBE_BLOCK 16indicates that the half data type fractal is16 x 16, and the code performsLoadDatatransfers in units of16 x 16fractals.Kernel-side Detailed ProcessCreateGlobalTensorhalfobjectsaGM,bGM,cGM, representing matrices A, B, C in GM (Global Memory).Obtain the current core ID through AscendC::GetBlockIdx() and calculatemIterIdx. This sample only splits tasks along the M axis, so each core only needs to process its own M-axis slice of matrix A and matrix C.Set GM address offsets:aGMoffset bymIterIdx * singleCoreM * K, enabling the current core to read its assigned row block of matrix A.bGMhas no offset, as each core needs to read the complete matrix B.cGMoffset bymIterIdx * singleCoreM * N, enabling the current core to write results back to its assigned row block of matrix C.Create staticLocalTensorthroughLocalMemAllocator:a1Local: Temporary storage of matrix A in L1.a2Local: Temporary storage of matrix A in L0A, forMmadto read.b1Local: Temporary storage of matrix B in L1.a1Localandb1Localuse the same L1 allocator and are allocated in order, avoiding manual L1 address offset maintenance.b2Local: Temporary storage of matrix B in L0B, forMmadto read.cLocal: Temporary storage of matrix multiplication result in L0C.CallDataCopy.md) to transfer matrices A and B from GM to L1. UseNd2NzParams.md) parameters to convert the input ND format data to Nz format required by Cube computation during the transfer.CallSetFlagHardEvent::MTE2_MTE1.md) andWaitFlagHardEvent::MTE2_MTE1.md) for synchronization.DataCopybelongs to the MTE2 pipeline, and the subsequentLoadDatabelongs to the MTE1 pipeline. MTE1 must wait for MTE2 to complete, to avoid reading L1 data that has not finished transferring.CallLoadDatato transfer matrix A from L1 to L0A, and matrix B from L1 to L0B. L0A and L0B are input buffers that the Cube matrix computation unit reads directly.CallSetFlagHardEvent::MTE1_M.md) andWaitFlagHardEvent::MTE1_M.md) for synchronization.LoadDatabelongs to the MTE1 pipeline, and the subsequentMmadbelongs to the M pipeline. The M pipeline must wait for MTE1 to complete, to avoid reading L0A/L0B data that has not finished transferring.CallMmad(cLocal, a2Local, b2Local, {baseM, baseN, baseK, 0, false, true})to execute matrix multiplication. HerebaseM 128,baseN 256,baseK 64, corresponding to the matrix block size computed by a single core at one time.CallSetFlagHardEvent::M_FIX.md) andWaitFlagHardEvent::M_FIX.md) for synchronization.Mmadbelongs to the M pipeline, and the subsequentFixpipebelongs to the FIX pipeline. The FIX pipeline must wait for the M pipeline to complete, to avoid reading L0C results that have not finished computing.CallFixpipeto convert thefloataccumulation result in L0C tohalfand transfer it back to the matrix C output location in GM.Finally, callPipeBarrier.md)PIPE_ALL()to ensure that related pipeline tasks within the current core complete.Invocation ImplementationUse the kernel invocation operatorto invoke the kernel function. When invoking, pass matrix specifications, single-core computation amount, and basic tile size as template parameters, and pass Device-side A, B, C matrix addresses as runtime parameters.API Parameter Description:The following structures all use curly braces{}for parameter passing. The meaning of each field is as follows (field order is consistent with API documentation; actual struct declarations may have different field orders):AscendC::Nd2NzParams.md)— Used byDataCopy.md) interface, describes ND→Nz format conversion parameters:struct Nd2NzParams { int32_t ndNum; // Number of ND matrices to transfer, [0, 4095] uint16_t nValue; // Number of rows in ND matrix, [0, 16384] int32_t dValue; // Number of columns in ND matrix, [0, 65535] int32_t srcNdMatrixStride; // Offset between starting addresses of adjacent ND matrices, unit: element, [0, 65535] int32_t srcDValue; // Offset between adjacent rows of the same ND matrix, unit: element, [1, 65535] uint16_t dstNzC0Stride; // Adjacent row offset after conversion of multiple rows from the same source row in destination Nz, unit: C0_SIZE(32B), [1, 16384] uint16_t dstNzNStride; // Adjacent row offset of Z-shaped matrices in destination Nz, unit: C0_SIZE(32B), [1, 16384] int32_t dstNzMatrixStride; // Starting address offset between adjacent Nz matrices in destination Nz, unit: element, [1, 65535] };For example, when transferring matrix A,{1, baseM, baseK, 0, K, baseM, 1, 0}converts baseM×baseK ND data to Nz format.AscendC::LoadData2DParams— Used byLoadDatainterface, describes data transfer parameters for matrix A L1 to L0A and matrix B L1 to L0B in Atlas A2 Training Series Products/Atlas A2 Inference Series Products, Atlas A3 Training Series Products/Atlas A3 Inference Series Products:struct LoadData2DParams { int32_t startIndex; // Fractal matrix ID (0 is the first), unit: 512B, [0, 65535] int32_t repeatTimes; // Number of iterations, each iteration processes 512B, [1, 255] int32_t srcStride; // Source fractal starting address interval between adjacent iterations, unit: 512B, [0, 65535] int32_t sid; // Reserved, set to 0 int32_t dstGap; // Destination fractal interval between adjacent iterations, unit: 512B, [0, 65535] bool ifTranspose; // Whether to transpose each fractal, default false bool addrMode; // Address update mode, falseincrement, truedecrement, default false };For example: In Atlas A2 Training Series Products/Atlas A2 Inference Series Products, Atlas A3 Training Series Products/Atlas A3 Inference Series Products, the layout format on L0A is Zz. When transferring matrix A, use{0, baseK / CUBE_BLOCK, baseM / CUBE_BLOCK, 0, 0, false, 0};When transferring matrix B, setifTransposetrueto complete Nz to Zn transpose transfer.AscendC::LoadData2DParamsV2— Used byLoadDatainterface, describes data transfer parameters for matrix A L1 to L0A and matrix B L1 to L0B in Ascend 950PR/Ascend 950DT products:struct LoadData2DParamsV2 { uint32_t mStartPosition; // M direction start position, unit: 512B uint32_t kStartPosition; // K direction start position, unit: 512B uint16_t mStep; // Number of fractals transferred in M direction uint16_t kStep; // Number of fractals transferred in K direction int32_t srcStride; // Source fractal interval in K direction, unit: 512B uint16_t dstStride; // Destination fractal interval in K direction, unit: 512B bool ifTranspose; // Whether to transpose each fractal, default false uint8_t sid; // Reserved, set to 0 };In Ascend 950PR/Ascend 950DT products, the layout format on L0A is Nz. When transferring matrix A, use{0, 0, baseM / CUBE_BLOCK, baseK / CUBE_BLOCK, baseM / CUBE_BLOCK, baseM / CUBE_BLOCK, false, 0}to complete A matrix Nz to Nz transfer in one operation; when transferring matrix B, use{0, 0, baseK / CUBE_BLOCK, baseN / CUBE_BLOCK, baseK / CUBE_BLOCK, baseN / CUBE_BLOCK, true, 0}to complete B matrix Nz to Zn transfer in one operation.AscendC::MmadParams— Used byMmadinterface, describes matrix multiplication parameters:struct MmadParams { uint16_t m; // Left matrix Height (M dimension), [0, 4095] uint16_t n; // Right matrix Width (N dimension), [0, 4095] uint16_t k; // Left matrix Width/Right matrix Height (K dimension), [0, 4095] uint16_t unitFlag; // Mmad and Fixpipe fine-grained parallelism control, default 0 bool cmatrixSource; // C matrix initial value source, falseCO1, trueC2, default false bool cmatrixInitVal; // Whether C matrix initial value is 0, default true };For example,{baseM, baseN, baseK, 0, false, true}computes a baseM×baseN output block and accumulates baseK length in the K direction.AscendC::FixpipeParamsV220— Used byFixpipeinterface, describes L0C to GM data transfer and precision conversion parameters:struct FixpipeParamsV220 { int32_t nSize; // Source Nz matrix N direction size, [1, 4095] uint16_t mSize; // Source Nz matrix M direction size (for Nz2ND, [1, 8192]) uint16_t srcStride; // Source Nz adjacent Z layout starting offset, unit: C0_SIZE, [0, 65535] int32_t dstStride; // Number of elements per row in destination ND matrix for Nz2ND, unit: element bool reluEn; // Whether to enable ReLU QuantMode_t quantPre; // Quantization mode, F322F16 means float→half uint64_t deqScalar; // Scalar quantization parameter, single scale value int32_t ndNum; // Number of source Nz matrices, [1, 65535] int32_t srcNdStride; // Starting address interval between different Nz matrices, unit: 16×C0_SIZE, [1, 512] int32_t dstNdStride; // Destination adjacent ND matrix offset, unit: element, [1, 65535] int32_t unitFlag; // Mmad and Fixpipe parallelism control };For example,{baseN, baseM, baseM, N, false, F322F16, 0, 1, 0, 0, 0}converts the baseM×baseN float32 result in L0C to half and writes it back to GM.Compilation and ExecutionExecute the following steps in the root directory of this sample to compile and run the sample.Configure Environment VariablesConfigure environment variables according to the installation method of the CANN development kit package on the current environment.source ${install_path}/cann/set_env.shNote:${install_path}is the CANN package installation directory. If no installation directory is specified, it is installed to/usr/local/Ascendby default.Sample ExecutionExecute the following commands in this sample directory.mkdir -p build cd build; # Create and enter build directory cmake -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # Build project (default npu mode) python3 ../scripts/gen_data.py # Generate test input data ./demo # Execute the compiled executable program to run the sample python3 ../scripts/verify_result.py output/output.bin output/golden.bin # Verify output result correctness, confirm algorithm logic is correctWhen using CPU debug or NPU simulation mode, add the-DCMAKE_ASC_RUN_MODEcpuor-DCMAKE_ASC_RUN_MODEsimparameter.Examples:cmake -DCMAKE_ASC_RUN_MODEcpu -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # CPU debug mode cmake -DCMAKE_ASC_RUN_MODEsim -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # NPU simulation modeNote:Before switching compilation modes, clear the cmake cache. Executerm CMakeCache.txtin the build directory and run cmake again.Compilation Options DescriptionOptionAvailable ValuesDescriptionCMAKE_ASC_RUN_MODEnpu(default),cpu,simRun mode: NPU execution, CPU debug, NPU simulationCMAKE_ASC_ARCHITECTURESdav-2201(default),dav-3510NPU architecture: dav-2201 corresponds to Atlas A2 Training Series Products/Atlas A2 Inference Series Products and Atlas A3 Training Series Products/Atlas A3 Inference Series Products, dav-3510 corresponds to Ascend 950PR/Ascend 950DTExecution ResultThe execution result is as follows, indicating successful precision comparison.test pass!Functional DebuggingprintfThis interface provides formatted output functionality for CPU domain or NPU domain debugging scenarios.Call the printf interface in the operator kernel-side implementation code where log information needs to be output to print relevant content.Example:AscendC::printf(matmul blockIdx%d\n, AscendC::GetBlockIdx());[!CAUTION]Note The printf (PRINTF) interface printing functionality has a certain impact on the actual running performance of the operator and is typically used during the debugging phase. Developers can disable the printing functionality by setting ASCENDC_DUMP0 as needed.DumpTensorFor operators developed based on operator projects, you can use the DumpTensor interface to dump the content of a specified Tensor. It also supports printing custom additional information (only uint32_t data type information is supported), such as printing the current line number.Call the DumpTensor interface in the operator kernel-side implementation code where Tensor data needs to be printed. Example:AscendC::DumpTensor(cLocal, baseM * baseN);[!CAUTION]Note The DumpTensor interface printing functionality has a certain impact on the actual running performance of the operator and is typically used during the debugging phase. Developers can disable the printing functionality by setting ASCENDC_DUMP0 as needed.Performance DebuggingmsProf Tool IntroductionUse themsproftool to obtain detailed performance data:msprof ./demo # Analyze sample performanceA folder with the PROF_ prefix is generated in the current directory. Themindstudio_profiler_outputdirectory saves Host and each Device performance data summary. For performance data analysis, view the files in this directory:PROF_xxxx_XXXXXX ├── device_{id} └── host └── mindstudio_profiler_log └── mindstudio_profiler_output # Saves Host and each Device performance data summary ├── msprof_*.json ├── xx_*.csv └── README.txt【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit创作声明：本文部分内容由AI辅助生成（AIGC），仅供参考

free-domains扩展应用：10个实用技巧助你高效管理多个域名与复杂DNS场景

free-domains扩展应用：10个实用技巧助你高效管理多个域名与复杂DNS场景【免费下载链接】free-domains 🌐 DNS configuration for some of my domains 项目地址: https://gitcode.com/gh_mirrors/fr/free-domains 在现代网络开发中，fr…

2026/6/20 11:04:39 阅读更多

论时空几何的永恒性与认知升维的路径——基于块状宇宙理论与碳硅共轭的生命形态演化假说（世毫九实验室原创研究）

论时空几何的永恒性与认知升维的路径——基于块状宇宙理论与碳硅共轭的生命形态演化假说（世毫九实验室原创研究） 作者：方见华单位：世毫九实验室摘要在广义相对论与认知几何学的双重视角下，时空并非随时间生成的动态…

2026/6/20 11:04:18 阅读更多

【物理教学】光学实验设备昂贵？LabVIEW虚拟仿真替代

行业背景牛顿环实验是大学物理光学教学中的经典内容，通过观察等厚干涉条纹来测量透镜曲率半径和光的波长。传统教学方案依赖物理光学实验平台，需要精密的牛顿环仪、读数显微镜和钠光灯等设备。一套完整的牛顿环实验装置采购成本较高，且镜片易…

2026/6/20 11:03:57 阅读更多

AMD Ryzen硬件调试终极指南：深入掌握SMU Debug Tool的10个核心技巧

AMD Ryzen硬件调试终极指南：深入掌握SMU Debug Tool的10个核心技巧【免费下载链接】SMUDebugTool A dedicated tool to help write/read various parameters of Ryzen-based systems, such as manual overclock, SMU, PCI, CPUID, MSR and Power Table. 项目地址…

2026/6/20 12:23:54 阅读更多

MPC5644A引脚电气特性解析与汽车电子硬件设计实战

1. MPC5644A引脚功能深度解析在汽车电子和工业控制领域，MPC5644A是一款久经考验的高性能32位微控制器。对于硬件工程师而言，数据手册中那密密麻麻的引脚定义和电气参数表格，既是设计的“圣经”，也常常是令人望而生畏的“天书”。很…

2026/6/20 12:23:34 阅读更多

RCE漏洞实战解析：从命令注入到代码执行，Pikachu靶场攻防演练

1. 项目概述：从靶场实战深入理解RCE漏洞如果你刚开始接触网络安全，或者对“漏洞”这个词还停留在概念层面，那么“远程代码执行”这个听起来就威力巨大的漏洞类型，绝对是你需要啃下的硬骨头。RCE，全称Remote Command/C…

2026/6/20 12:21:31 阅读更多

2.4GHz WLAN功率放大器SST12CP33：从核心参数到PCB布局的硬件设计指南

1. 项目概述：从一颗芯片到无线信号的“发动机” 在无线通信的世界里，信号就像声音，离得远了就听不清。无论是家里的路由器覆盖不到阳台，还是工厂里设备间数据传输时断时续，背后往往都指向同一个核心问题：射…

2026/6/20 12:21:11 阅读更多

MPC8260 PowerQUICC II处理器硬件设计详解：架构、时序与实战避坑

1. MPC8260 PowerQUICC II处理器：通信设备的心脏与骨架在路由器、交换机、工业网关这些我们每天依赖的网络设备内部，真正驱动数据洪流、决定设备性能与可靠性的，往往是一颗不起眼的芯片——通信处理器。它不是通用CPU，而是为处理…

2026/6/20 12:21:11 阅读更多

计算机Python毕设实战-基于 Python+Django 的校园闲置物品交易系统设计 B/S 架构下校园二手商品交易平台的设计与实现【完整源码+LW+部署说明+演示视频，全bao一条龙等】

博主介绍：✌️码农一枚 ，专注于大学生项目实战开发、讲解和毕业🚢文撰写修改等。全栈领域优质创作者，博客之星、掘金/华为云/阿里云/InfoQ等平台优质作者、专注于Java、小程序技术领域和毕业项目实战 ✌️技术范围：&am…

2026/6/20 12:20:09 阅读更多

MCU系统集成模块(SIM)详解：复位、中断与低功耗管理实战

1. 系统集成模块(SIM)在MCU中的核心角色在嵌入式开发领域，尤其是面对工业控制、汽车电子这类对可靠性要求极高的场景，我们常常把目光聚焦在CPU性能、外设功能或者通信协议栈上。然而，一个真正稳定、可靠的系统，其基石往往是一个默…

2026/6/20 0:00:26 阅读更多

MC68HC908RF2A定时器PWM生成原理与实战：无缓冲与缓冲模式详解

1. 项目概述与核心价值在嵌入式开发，尤其是电机驱动、LED调光、开关电源这些需要精确控制“能量”的领域，脉冲宽度调制（PWM）技术是工程师手中的一把瑞士军刀。它的本质很简单：用一个固定频率的方波，通过改变…

2026/6/20 0:02:08 阅读更多

在银河麒麟V10桌面(2205版本)上实战部署软RAID 1：从模块黑名单到自动挂载

1. 银河麒麟V10桌面系统与软RAID 1基础认知第一次在银河麒麟V10桌面上折腾软RAID 1时，我踩了不少坑。这个国产操作系统基于Linux内核，但2205版本对软RAID模块做了特殊处理，需要额外操作才能正常使用。软RAID 1其实就是磁盘镜像技术&#xff…

2026/6/20 0:02:08 阅读更多

音乐文件解锁实战指南：3个场景解决你的播放困境

音乐文件解锁实战指南：3个场景解决你的播放困境【免费下载链接】unlock-music 在浏览器中解锁加密的音乐文件。原仓库： 1. https://github.com/unlock-music/unlock-music ；2. https://git.unlock-music.dev/um/web 项目地址: https://git…

2026/6/20 0:58:06 阅读更多

从Landsat到高分系列：手把手教你选择适合自己项目的遥感卫星数据

遥感卫星数据选型实战指南：从参数解析到场景化应用当面对GEE、PIE-Engine等云平台上数十种遥感数据源时，许多研究者常陷入选择困难——Landsat的历史连续性、Sentinel-2的红边波段优势、高分系列的亚米级分辨率各有千秋。本文将打破常规参数罗列式对比&a…

2026/6/20 0:58:07 阅读更多

MC68302 AutoBaud技术：硬件级串口波特率自动检测原理与实现

1. 项目概述：MC68302 AutoBaud技术深度解析在嵌入式系统开发，尤其是那些需要与外部设备进行串口通信的场景里，最让人头疼的环节之一就是波特率匹配。想象一下，你设计了一个数据采集终端，需要连接来自不同厂家、不同年代…

2026/6/20 0:58:03 阅读更多

Zotero Duplicates Merger：5步彻底清理文献库重复条目

Zotero Duplicates Merger：5步彻底清理文献库重复条目【免费下载链接】ZoteroDuplicatesMerger A zotero plugin to automatically merge duplicate items 项目地址: https://gitcode.com/gh_mirrors/zo/ZoteroDuplicatesMerger 还在为文献库中堆积如山的重…

2026/6/20 11:30:09 阅读更多

利用随机有限集理论对蜂群的ILQR和MPC控制研究附Matlab代码

✅作者简介：热爱科研的Matlab仿真开发者，擅长数据处理、建模仿真、程序设计、完整代码获取、论文复现及科研仿真。🍎 往期回顾关注个人主页：Matlab科研工作室🍊个人信条：格物致知,完整Matlab代码及仿真咨询…

2026/6/20 11:30:15 阅读更多

为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因

更多请点击： https://intelliparadigm.com 第一章：为什么你的Gemini邮件CTE低于行业均值2.8倍？：从Prompt架构到发送时序的深度归因 Gemini邮件的客户转化效率（CTE）显著偏低，根本原因常被误判为…