Matmul Basic API Computation Based on Static Tensor Programming【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkitOverviewThis sample implements multi-core matrix multiplication computation based on the static Tensor programming paradigm.Supported Products and CANN Software VersionsProductCANN Software VersionAscend 950PR/Ascend 950DT CANN 9.1.0Atlas A3 Training Series Products/Atlas A3 Inference Series Products CANN 9.0.0Atlas A2 Training Series Products/Atlas A2 Inference Series Products CANN 9.0.0Directory Structure├── matmul_basic_api │ ├── scripts │ │ ├── gen_data.py // Script for generating input data and golden data │ │ └── verify_result.py // Golden value comparison file │ ├── CMakeLists.txt // Build project file │ ├── data_utils.h // Data read and write functions │ ├── matmul_basic_api.asc // Ascend C sample implementation invocation sample │ └── README.md // Sample documentationSample DescriptionSample Functionality:This sample uses Ascend C basic API to implement a basic matrix multiplication (Matmul) kernel function. The matrix multiplication formula is as follows: $$ C A * B $$ Where matrix A has shape[M, K], matrix B has shape[K, N], and output matrix C has shape[M, N]. For each elementC[m, n]in the output matrix C, the product of rowmof matrix A and columnnof matrix B along the K axis accumulates. In matrix multiplication,M directionrefers to the row direction of matrix C,N directionrefers to the column direction of matrix C, andK directionrefers to the inner dimension (accumulation dimension) of matrix C multiplication.Sample Specifications:This sample uses parametersM 256, N 256, K 64, with both input and output inhalftype andNDformat. The sample launches 2 cores to complete the computation, with each core responsible for 128 rows in the M axis direction and all 256 columns in the N axis direction of the output matrix C:Core 0 computes rows0~127of matrix C.Core 1 computes rows128~255of matrix C.The input and output specifications are shown in the following table:Sample Type(OpType)MatmulSample Inputnameshapedata typeformatA[M, K]halfNDB[K, N]halfNDSample OutputC[M, N]halfNDKernel Function Namemmad_customSample Implementation:Kernel-side Overall Approachmmad_customis a__global____cube__kernel function, which indicates that this function runs on the Cube computation unit of AI Core, primarily used for matrix computation.The sample uses the static Tensor programming method and createsLocalTensorthroughLocalMemAllocator.CUBE_BLOCK 16indicates that the half data type fractal is16 x 16, and the code performsLoadDatatransfers in units of16 x 16fractals.Kernel-side Detailed ProcessCreateGlobalTensorhalfobjectsaGM,bGM,cGM, representing matrices A, B, C in GM (Global Memory).Obtain the current core ID through AscendC::GetBlockIdx() and calculatemIterIdx. This sample only splits tasks along the M axis, so each core only needs to process its own M-axis slice of matrix A and matrix C.Set GM address offsets:aGMoffset bymIterIdx * singleCoreM * K, enabling the current core to read its assigned row block of matrix A.bGMhas no offset, as each core needs to read the complete matrix B.cGMoffset bymIterIdx * singleCoreM * N, enabling the current core to write results back to its assigned row block of matrix C.Create staticLocalTensorthroughLocalMemAllocator:a1Local: Temporary storage of matrix A in L1.a2Local: Temporary storage of matrix A in L0A, forMmadto read.b1Local: Temporary storage of matrix B in L1.a1Localandb1Localuse the same L1 allocator and are allocated in order, avoiding manual L1 address offset maintenance.b2Local: Temporary storage of matrix B in L0B, forMmadto read.cLocal: Temporary storage of matrix multiplication result in L0C.CallDataCopy.md) to transfer matrices A and B from GM to L1. UseNd2NzParams.md) parameters to convert the input ND format data to Nz format required by Cube computation during the transfer.CallSetFlagHardEvent::MTE2_MTE1.md) andWaitFlagHardEvent::MTE2_MTE1.md) for synchronization.DataCopybelongs to the MTE2 pipeline, and the subsequentLoadDatabelongs to the MTE1 pipeline. MTE1 must wait for MTE2 to complete, to avoid reading L1 data that has not finished transferring.CallLoadDatato transfer matrix A from L1 to L0A, and matrix B from L1 to L0B. L0A and L0B are input buffers that the Cube matrix computation unit reads directly.CallSetFlagHardEvent::MTE1_M.md) andWaitFlagHardEvent::MTE1_M.md) for synchronization.LoadDatabelongs to the MTE1 pipeline, and the subsequentMmadbelongs to the M pipeline. The M pipeline must wait for MTE1 to complete, to avoid reading L0A/L0B data that has not finished transferring.CallMmad(cLocal, a2Local, b2Local, {baseM, baseN, baseK, 0, false, true})to execute matrix multiplication. HerebaseM 128,baseN 256,baseK 64, corresponding to the matrix block size computed by a single core at one time.CallSetFlagHardEvent::M_FIX.md) andWaitFlagHardEvent::M_FIX.md) for synchronization.Mmadbelongs to the M pipeline, and the subsequentFixpipebelongs to the FIX pipeline. The FIX pipeline must wait for the M pipeline to complete, to avoid reading L0C results that have not finished computing.CallFixpipeto convert thefloataccumulation result in L0C tohalfand transfer it back to the matrix C output location in GM.Finally, callPipeBarrier.md)PIPE_ALL()to ensure that related pipeline tasks within the current core complete.Invocation ImplementationUse the kernel invocation operatorto invoke the kernel function. When invoking, pass matrix specifications, single-core computation amount, and basic tile size as template parameters, and pass Device-side A, B, C matrix addresses as runtime parameters.API Parameter Description:The following structures all use curly braces{}for parameter passing. The meaning of each field is as follows (field order is consistent with API documentation; actual struct declarations may have different field orders):AscendC::Nd2NzParams.md)— Used byDataCopy.md) interface, describes ND→Nz format conversion parameters:struct Nd2NzParams { int32_t ndNum; // Number of ND matrices to transfer, [0, 4095] uint16_t nValue; // Number of rows in ND matrix, [0, 16384] int32_t dValue; // Number of columns in ND matrix, [0, 65535] int32_t srcNdMatrixStride; // Offset between starting addresses of adjacent ND matrices, unit: element, [0, 65535] int32_t srcDValue; // Offset between adjacent rows of the same ND matrix, unit: element, [1, 65535] uint16_t dstNzC0Stride; // Adjacent row offset after conversion of multiple rows from the same source row in destination Nz, unit: C0_SIZE(32B), [1, 16384] uint16_t dstNzNStride; // Adjacent row offset of Z-shaped matrices in destination Nz, unit: C0_SIZE(32B), [1, 16384] int32_t dstNzMatrixStride; // Starting address offset between adjacent Nz matrices in destination Nz, unit: element, [1, 65535] };For example, when transferring matrix A,{1, baseM, baseK, 0, K, baseM, 1, 0}converts baseM×baseK ND data to Nz format.AscendC::LoadData2DParams— Used byLoadDatainterface, describes data transfer parameters for matrix A L1 to L0A and matrix B L1 to L0B in Atlas A2 Training Series Products/Atlas A2 Inference Series Products, Atlas A3 Training Series Products/Atlas A3 Inference Series Products:struct LoadData2DParams { int32_t startIndex; // Fractal matrix ID (0 is the first), unit: 512B, [0, 65535] int32_t repeatTimes; // Number of iterations, each iteration processes 512B, [1, 255] int32_t srcStride; // Source fractal starting address interval between adjacent iterations, unit: 512B, [0, 65535] int32_t sid; // Reserved, set to 0 int32_t dstGap; // Destination fractal interval between adjacent iterations, unit: 512B, [0, 65535] bool ifTranspose; // Whether to transpose each fractal, default false bool addrMode; // Address update mode, falseincrement, truedecrement, default false };For example: In Atlas A2 Training Series Products/Atlas A2 Inference Series Products, Atlas A3 Training Series Products/Atlas A3 Inference Series Products, the layout format on L0A is Zz. When transferring matrix A, use{0, baseK / CUBE_BLOCK, baseM / CUBE_BLOCK, 0, 0, false, 0};When transferring matrix B, setifTransposetrueto complete Nz to Zn transpose transfer.AscendC::LoadData2DParamsV2— Used byLoadDatainterface, describes data transfer parameters for matrix A L1 to L0A and matrix B L1 to L0B in Ascend 950PR/Ascend 950DT products:struct LoadData2DParamsV2 { uint32_t mStartPosition; // M direction start position, unit: 512B uint32_t kStartPosition; // K direction start position, unit: 512B uint16_t mStep; // Number of fractals transferred in M direction uint16_t kStep; // Number of fractals transferred in K direction int32_t srcStride; // Source fractal interval in K direction, unit: 512B uint16_t dstStride; // Destination fractal interval in K direction, unit: 512B bool ifTranspose; // Whether to transpose each fractal, default false uint8_t sid; // Reserved, set to 0 };In Ascend 950PR/Ascend 950DT products, the layout format on L0A is Nz. When transferring matrix A, use{0, 0, baseM / CUBE_BLOCK, baseK / CUBE_BLOCK, baseM / CUBE_BLOCK, baseM / CUBE_BLOCK, false, 0}to complete A matrix Nz to Nz transfer in one operation; when transferring matrix B, use{0, 0, baseK / CUBE_BLOCK, baseN / CUBE_BLOCK, baseK / CUBE_BLOCK, baseN / CUBE_BLOCK, true, 0}to complete B matrix Nz to Zn transfer in one operation.AscendC::MmadParams— Used byMmadinterface, describes matrix multiplication parameters:struct MmadParams { uint16_t m; // Left matrix Height (M dimension), [0, 4095] uint16_t n; // Right matrix Width (N dimension), [0, 4095] uint16_t k; // Left matrix Width/Right matrix Height (K dimension), [0, 4095] uint16_t unitFlag; // Mmad and Fixpipe fine-grained parallelism control, default 0 bool cmatrixSource; // C matrix initial value source, falseCO1, trueC2, default false bool cmatrixInitVal; // Whether C matrix initial value is 0, default true };For example,{baseM, baseN, baseK, 0, false, true}computes a baseM×baseN output block and accumulates baseK length in the K direction.AscendC::FixpipeParamsV220— Used byFixpipeinterface, describes L0C to GM data transfer and precision conversion parameters:struct FixpipeParamsV220 { int32_t nSize; // Source Nz matrix N direction size, [1, 4095] uint16_t mSize; // Source Nz matrix M direction size (for Nz2ND, [1, 8192]) uint16_t srcStride; // Source Nz adjacent Z layout starting offset, unit: C0_SIZE, [0, 65535] int32_t dstStride; // Number of elements per row in destination ND matrix for Nz2ND, unit: element bool reluEn; // Whether to enable ReLU QuantMode_t quantPre; // Quantization mode, F322F16 means float→half uint64_t deqScalar; // Scalar quantization parameter, single scale value int32_t ndNum; // Number of source Nz matrices, [1, 65535] int32_t srcNdStride; // Starting address interval between different Nz matrices, unit: 16×C0_SIZE, [1, 512] int32_t dstNdStride; // Destination adjacent ND matrix offset, unit: element, [1, 65535] int32_t unitFlag; // Mmad and Fixpipe parallelism control };For example,{baseN, baseM, baseM, N, false, F322F16, 0, 1, 0, 0, 0}converts the baseM×baseN float32 result in L0C to half and writes it back to GM.Compilation and ExecutionExecute the following steps in the root directory of this sample to compile and run the sample.Configure Environment VariablesConfigure environment variables according to the installation method of the CANN development kit package on the current environment.source ${install_path}/cann/set_env.shNote:${install_path}is the CANN package installation directory. If no installation directory is specified, it is installed to/usr/local/Ascendby default.Sample ExecutionExecute the following commands in this sample directory.mkdir -p build cd build; # Create and enter build directory cmake -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # Build project (default npu mode) python3 ../scripts/gen_data.py # Generate test input data ./demo # Execute the compiled executable program to run the sample python3 ../scripts/verify_result.py output/output.bin output/golden.bin # Verify output result correctness, confirm algorithm logic is correctWhen using CPU debug or NPU simulation mode, add the-DCMAKE_ASC_RUN_MODEcpuor-DCMAKE_ASC_RUN_MODEsimparameter.Examples:cmake -DCMAKE_ASC_RUN_MODEcpu -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # CPU debug mode cmake -DCMAKE_ASC_RUN_MODEsim -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # NPU simulation modeNote:Before switching compilation modes, clear the cmake cache. Executerm CMakeCache.txtin the build directory and run cmake again.Compilation Options DescriptionOptionAvailable ValuesDescriptionCMAKE_ASC_RUN_MODEnpu(default),cpu,simRun mode: NPU execution, CPU debug, NPU simulationCMAKE_ASC_ARCHITECTURESdav-2201(default),dav-3510NPU architecture: dav-2201 corresponds to Atlas A2 Training Series Products/Atlas A2 Inference Series Products and Atlas A3 Training Series Products/Atlas A3 Inference Series Products, dav-3510 corresponds to Ascend 950PR/Ascend 950DTExecution ResultThe execution result is as follows, indicating successful precision comparison.test pass!Functional DebuggingprintfThis interface provides formatted output functionality for CPU domain or NPU domain debugging scenarios.Call the printf interface in the operator kernel-side implementation code where log information needs to be output to print relevant content.Example:AscendC::printf(matmul blockIdx%d\n, AscendC::GetBlockIdx());[!CAUTION]Note The printf (PRINTF) interface printing functionality has a certain impact on the actual running performance of the operator and is typically used during the debugging phase. Developers can disable the printing functionality by setting ASCENDC_DUMP0 as needed.DumpTensorFor operators developed based on operator projects, you can use the DumpTensor interface to dump the content of a specified Tensor. It also supports printing custom additional information (only uint32_t data type information is supported), such as printing the current line number.Call the DumpTensor interface in the operator kernel-side implementation code where Tensor data needs to be printed. Example:AscendC::DumpTensor(cLocal, baseM * baseN);[!CAUTION]Note The DumpTensor interface printing functionality has a certain impact on the actual running performance of the operator and is typically used during the debugging phase. Developers can disable the printing functionality by setting ASCENDC_DUMP0 as needed.Performance DebuggingmsProf Tool IntroductionUse themsproftool to obtain detailed performance data:msprof ./demo # Analyze sample performanceA folder with the PROF_ prefix is generated in the current directory. Themindstudio_profiler_outputdirectory saves Host and each Device performance data summary. For performance data analysis, view the files in this directory:PROF_xxxx_XXXXXX ├── device_{id} └── host └── mindstudio_profiler_log └── mindstudio_profiler_output # Saves Host and each Device performance data summary ├── msprof_*.json ├── xx_*.csv └── README.txt【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考
CANN/asc-devkit Matmul基础API
发布时间:2026/6/20 11:04:39
Matmul Basic API Computation Based on Static Tensor Programming【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkitOverviewThis sample implements multi-core matrix multiplication computation based on the static Tensor programming paradigm.Supported Products and CANN Software VersionsProductCANN Software VersionAscend 950PR/Ascend 950DT CANN 9.1.0Atlas A3 Training Series Products/Atlas A3 Inference Series Products CANN 9.0.0Atlas A2 Training Series Products/Atlas A2 Inference Series Products CANN 9.0.0Directory Structure├── matmul_basic_api │ ├── scripts │ │ ├── gen_data.py // Script for generating input data and golden data │ │ └── verify_result.py // Golden value comparison file │ ├── CMakeLists.txt // Build project file │ ├── data_utils.h // Data read and write functions │ ├── matmul_basic_api.asc // Ascend C sample implementation invocation sample │ └── README.md // Sample documentationSample DescriptionSample Functionality:This sample uses Ascend C basic API to implement a basic matrix multiplication (Matmul) kernel function. The matrix multiplication formula is as follows: $$ C A * B $$ Where matrix A has shape[M, K], matrix B has shape[K, N], and output matrix C has shape[M, N]. For each elementC[m, n]in the output matrix C, the product of rowmof matrix A and columnnof matrix B along the K axis accumulates. In matrix multiplication,M directionrefers to the row direction of matrix C,N directionrefers to the column direction of matrix C, andK directionrefers to the inner dimension (accumulation dimension) of matrix C multiplication.Sample Specifications:This sample uses parametersM 256, N 256, K 64, with both input and output inhalftype andNDformat. The sample launches 2 cores to complete the computation, with each core responsible for 128 rows in the M axis direction and all 256 columns in the N axis direction of the output matrix C:Core 0 computes rows0~127of matrix C.Core 1 computes rows128~255of matrix C.The input and output specifications are shown in the following table:Sample Type(OpType)MatmulSample Inputnameshapedata typeformatA[M, K]halfNDB[K, N]halfNDSample OutputC[M, N]halfNDKernel Function Namemmad_customSample Implementation:Kernel-side Overall Approachmmad_customis a__global____cube__kernel function, which indicates that this function runs on the Cube computation unit of AI Core, primarily used for matrix computation.The sample uses the static Tensor programming method and createsLocalTensorthroughLocalMemAllocator.CUBE_BLOCK 16indicates that the half data type fractal is16 x 16, and the code performsLoadDatatransfers in units of16 x 16fractals.Kernel-side Detailed ProcessCreateGlobalTensorhalfobjectsaGM,bGM,cGM, representing matrices A, B, C in GM (Global Memory).Obtain the current core ID through AscendC::GetBlockIdx() and calculatemIterIdx. This sample only splits tasks along the M axis, so each core only needs to process its own M-axis slice of matrix A and matrix C.Set GM address offsets:aGMoffset bymIterIdx * singleCoreM * K, enabling the current core to read its assigned row block of matrix A.bGMhas no offset, as each core needs to read the complete matrix B.cGMoffset bymIterIdx * singleCoreM * N, enabling the current core to write results back to its assigned row block of matrix C.Create staticLocalTensorthroughLocalMemAllocator:a1Local: Temporary storage of matrix A in L1.a2Local: Temporary storage of matrix A in L0A, forMmadto read.b1Local: Temporary storage of matrix B in L1.a1Localandb1Localuse the same L1 allocator and are allocated in order, avoiding manual L1 address offset maintenance.b2Local: Temporary storage of matrix B in L0B, forMmadto read.cLocal: Temporary storage of matrix multiplication result in L0C.CallDataCopy.md) to transfer matrices A and B from GM to L1. UseNd2NzParams.md) parameters to convert the input ND format data to Nz format required by Cube computation during the transfer.CallSetFlagHardEvent::MTE2_MTE1.md) andWaitFlagHardEvent::MTE2_MTE1.md) for synchronization.DataCopybelongs to the MTE2 pipeline, and the subsequentLoadDatabelongs to the MTE1 pipeline. MTE1 must wait for MTE2 to complete, to avoid reading L1 data that has not finished transferring.CallLoadDatato transfer matrix A from L1 to L0A, and matrix B from L1 to L0B. L0A and L0B are input buffers that the Cube matrix computation unit reads directly.CallSetFlagHardEvent::MTE1_M.md) andWaitFlagHardEvent::MTE1_M.md) for synchronization.LoadDatabelongs to the MTE1 pipeline, and the subsequentMmadbelongs to the M pipeline. The M pipeline must wait for MTE1 to complete, to avoid reading L0A/L0B data that has not finished transferring.CallMmad(cLocal, a2Local, b2Local, {baseM, baseN, baseK, 0, false, true})to execute matrix multiplication. HerebaseM 128,baseN 256,baseK 64, corresponding to the matrix block size computed by a single core at one time.CallSetFlagHardEvent::M_FIX.md) andWaitFlagHardEvent::M_FIX.md) for synchronization.Mmadbelongs to the M pipeline, and the subsequentFixpipebelongs to the FIX pipeline. The FIX pipeline must wait for the M pipeline to complete, to avoid reading L0C results that have not finished computing.CallFixpipeto convert thefloataccumulation result in L0C tohalfand transfer it back to the matrix C output location in GM.Finally, callPipeBarrier.md)PIPE_ALL()to ensure that related pipeline tasks within the current core complete.Invocation ImplementationUse the kernel invocation operatorto invoke the kernel function. When invoking, pass matrix specifications, single-core computation amount, and basic tile size as template parameters, and pass Device-side A, B, C matrix addresses as runtime parameters.API Parameter Description:The following structures all use curly braces{}for parameter passing. The meaning of each field is as follows (field order is consistent with API documentation; actual struct declarations may have different field orders):AscendC::Nd2NzParams.md)— Used byDataCopy.md) interface, describes ND→Nz format conversion parameters:struct Nd2NzParams { int32_t ndNum; // Number of ND matrices to transfer, [0, 4095] uint16_t nValue; // Number of rows in ND matrix, [0, 16384] int32_t dValue; // Number of columns in ND matrix, [0, 65535] int32_t srcNdMatrixStride; // Offset between starting addresses of adjacent ND matrices, unit: element, [0, 65535] int32_t srcDValue; // Offset between adjacent rows of the same ND matrix, unit: element, [1, 65535] uint16_t dstNzC0Stride; // Adjacent row offset after conversion of multiple rows from the same source row in destination Nz, unit: C0_SIZE(32B), [1, 16384] uint16_t dstNzNStride; // Adjacent row offset of Z-shaped matrices in destination Nz, unit: C0_SIZE(32B), [1, 16384] int32_t dstNzMatrixStride; // Starting address offset between adjacent Nz matrices in destination Nz, unit: element, [1, 65535] };For example, when transferring matrix A,{1, baseM, baseK, 0, K, baseM, 1, 0}converts baseM×baseK ND data to Nz format.AscendC::LoadData2DParams— Used byLoadDatainterface, describes data transfer parameters for matrix A L1 to L0A and matrix B L1 to L0B in Atlas A2 Training Series Products/Atlas A2 Inference Series Products, Atlas A3 Training Series Products/Atlas A3 Inference Series Products:struct LoadData2DParams { int32_t startIndex; // Fractal matrix ID (0 is the first), unit: 512B, [0, 65535] int32_t repeatTimes; // Number of iterations, each iteration processes 512B, [1, 255] int32_t srcStride; // Source fractal starting address interval between adjacent iterations, unit: 512B, [0, 65535] int32_t sid; // Reserved, set to 0 int32_t dstGap; // Destination fractal interval between adjacent iterations, unit: 512B, [0, 65535] bool ifTranspose; // Whether to transpose each fractal, default false bool addrMode; // Address update mode, falseincrement, truedecrement, default false };For example: In Atlas A2 Training Series Products/Atlas A2 Inference Series Products, Atlas A3 Training Series Products/Atlas A3 Inference Series Products, the layout format on L0A is Zz. When transferring matrix A, use{0, baseK / CUBE_BLOCK, baseM / CUBE_BLOCK, 0, 0, false, 0};When transferring matrix B, setifTransposetrueto complete Nz to Zn transpose transfer.AscendC::LoadData2DParamsV2— Used byLoadDatainterface, describes data transfer parameters for matrix A L1 to L0A and matrix B L1 to L0B in Ascend 950PR/Ascend 950DT products:struct LoadData2DParamsV2 { uint32_t mStartPosition; // M direction start position, unit: 512B uint32_t kStartPosition; // K direction start position, unit: 512B uint16_t mStep; // Number of fractals transferred in M direction uint16_t kStep; // Number of fractals transferred in K direction int32_t srcStride; // Source fractal interval in K direction, unit: 512B uint16_t dstStride; // Destination fractal interval in K direction, unit: 512B bool ifTranspose; // Whether to transpose each fractal, default false uint8_t sid; // Reserved, set to 0 };In Ascend 950PR/Ascend 950DT products, the layout format on L0A is Nz. When transferring matrix A, use{0, 0, baseM / CUBE_BLOCK, baseK / CUBE_BLOCK, baseM / CUBE_BLOCK, baseM / CUBE_BLOCK, false, 0}to complete A matrix Nz to Nz transfer in one operation; when transferring matrix B, use{0, 0, baseK / CUBE_BLOCK, baseN / CUBE_BLOCK, baseK / CUBE_BLOCK, baseN / CUBE_BLOCK, true, 0}to complete B matrix Nz to Zn transfer in one operation.AscendC::MmadParams— Used byMmadinterface, describes matrix multiplication parameters:struct MmadParams { uint16_t m; // Left matrix Height (M dimension), [0, 4095] uint16_t n; // Right matrix Width (N dimension), [0, 4095] uint16_t k; // Left matrix Width/Right matrix Height (K dimension), [0, 4095] uint16_t unitFlag; // Mmad and Fixpipe fine-grained parallelism control, default 0 bool cmatrixSource; // C matrix initial value source, falseCO1, trueC2, default false bool cmatrixInitVal; // Whether C matrix initial value is 0, default true };For example,{baseM, baseN, baseK, 0, false, true}computes a baseM×baseN output block and accumulates baseK length in the K direction.AscendC::FixpipeParamsV220— Used byFixpipeinterface, describes L0C to GM data transfer and precision conversion parameters:struct FixpipeParamsV220 { int32_t nSize; // Source Nz matrix N direction size, [1, 4095] uint16_t mSize; // Source Nz matrix M direction size (for Nz2ND, [1, 8192]) uint16_t srcStride; // Source Nz adjacent Z layout starting offset, unit: C0_SIZE, [0, 65535] int32_t dstStride; // Number of elements per row in destination ND matrix for Nz2ND, unit: element bool reluEn; // Whether to enable ReLU QuantMode_t quantPre; // Quantization mode, F322F16 means float→half uint64_t deqScalar; // Scalar quantization parameter, single scale value int32_t ndNum; // Number of source Nz matrices, [1, 65535] int32_t srcNdStride; // Starting address interval between different Nz matrices, unit: 16×C0_SIZE, [1, 512] int32_t dstNdStride; // Destination adjacent ND matrix offset, unit: element, [1, 65535] int32_t unitFlag; // Mmad and Fixpipe parallelism control };For example,{baseN, baseM, baseM, N, false, F322F16, 0, 1, 0, 0, 0}converts the baseM×baseN float32 result in L0C to half and writes it back to GM.Compilation and ExecutionExecute the following steps in the root directory of this sample to compile and run the sample.Configure Environment VariablesConfigure environment variables according to the installation method of the CANN development kit package on the current environment.source ${install_path}/cann/set_env.shNote:${install_path}is the CANN package installation directory. If no installation directory is specified, it is installed to/usr/local/Ascendby default.Sample ExecutionExecute the following commands in this sample directory.mkdir -p build cd build; # Create and enter build directory cmake -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # Build project (default npu mode) python3 ../scripts/gen_data.py # Generate test input data ./demo # Execute the compiled executable program to run the sample python3 ../scripts/verify_result.py output/output.bin output/golden.bin # Verify output result correctness, confirm algorithm logic is correctWhen using CPU debug or NPU simulation mode, add the-DCMAKE_ASC_RUN_MODEcpuor-DCMAKE_ASC_RUN_MODEsimparameter.Examples:cmake -DCMAKE_ASC_RUN_MODEcpu -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # CPU debug mode cmake -DCMAKE_ASC_RUN_MODEsim -DCMAKE_ASC_ARCHITECTURESdav-2201 ..;make -j; # NPU simulation modeNote:Before switching compilation modes, clear the cmake cache. Executerm CMakeCache.txtin the build directory and run cmake again.Compilation Options DescriptionOptionAvailable ValuesDescriptionCMAKE_ASC_RUN_MODEnpu(default),cpu,simRun mode: NPU execution, CPU debug, NPU simulationCMAKE_ASC_ARCHITECTURESdav-2201(default),dav-3510NPU architecture: dav-2201 corresponds to Atlas A2 Training Series Products/Atlas A2 Inference Series Products and Atlas A3 Training Series Products/Atlas A3 Inference Series Products, dav-3510 corresponds to Ascend 950PR/Ascend 950DTExecution ResultThe execution result is as follows, indicating successful precision comparison.test pass!Functional DebuggingprintfThis interface provides formatted output functionality for CPU domain or NPU domain debugging scenarios.Call the printf interface in the operator kernel-side implementation code where log information needs to be output to print relevant content.Example:AscendC::printf(matmul blockIdx%d\n, AscendC::GetBlockIdx());[!CAUTION]Note The printf (PRINTF) interface printing functionality has a certain impact on the actual running performance of the operator and is typically used during the debugging phase. Developers can disable the printing functionality by setting ASCENDC_DUMP0 as needed.DumpTensorFor operators developed based on operator projects, you can use the DumpTensor interface to dump the content of a specified Tensor. It also supports printing custom additional information (only uint32_t data type information is supported), such as printing the current line number.Call the DumpTensor interface in the operator kernel-side implementation code where Tensor data needs to be printed. Example:AscendC::DumpTensor(cLocal, baseM * baseN);[!CAUTION]Note The DumpTensor interface printing functionality has a certain impact on the actual running performance of the operator and is typically used during the debugging phase. Developers can disable the printing functionality by setting ASCENDC_DUMP0 as needed.Performance DebuggingmsProf Tool IntroductionUse themsproftool to obtain detailed performance data:msprof ./demo # Analyze sample performanceA folder with the PROF_ prefix is generated in the current directory. Themindstudio_profiler_outputdirectory saves Host and each Device performance data summary. For performance data analysis, view the files in this directory:PROF_xxxx_XXXXXX ├── device_{id} └── host └── mindstudio_profiler_log └── mindstudio_profiler_output # Saves Host and each Device performance data summary ├── msprof_*.json ├── xx_*.csv └── README.txt【免费下载链接】asc-devkit本项目是CANN 推出的昇腾AI处理器专用的算子程序开发语言原生支持C和C标准规范主要由类库和语言扩展层构成提供多层级API满足多维场景算子开发诉求。项目地址: https://gitcode.com/cann/asc-devkit创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考