CANN/cannbot-skills 计数器与缓冲约束 Counter and Buffering Constraints【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skillsRead this file when a kernel usesDBuff,TBuff, nested loops, delayed reuse, or autosync-sensitive slot ownership.GoalKeep buffer lifetimes explicit enough that:ownership stays understandableslot lineage stays stableautosync grouping stays correctdifferent stages do not silently fight over one counter1. Core ruleDifferent buffer lifetimes must use different counters.This is not an optional cleanup preference. It is a required authoring rule in this repository.Symmetric rule:same-lifetime paired buffers may share one counter2. What counts as the same lifetimeBuffers may share one counter when they are:loaded togetherconsumed togetherretired togetherlogically one stage pairTypical example:l1xandl1yfor the samektile may share onel1_cntBuffers should not share a counter when one of them lives across a different loop boundary or stage boundary.Typical example:l1streaming inside thekloop should not share the same counter with an outerL0Cor vec postprocess stageConcrete nested matmul example:ifl1x/l1yadvance on everyktile and are consumed as one operand pair, they may share onel1_cntifl0cis owned by the outerntile, keep itsl0c_cntseparate from thek-loopl1_cntreusing one counter across those different loop-owned lifetimes can blur autosync slot lineage and break the pipeline rhythm3. Why this mattersReusing one counter across different lifetimes can break reasoning in several ways:DBuff parity stops matching the real stage lifetimeautosync grouping sees a misleading slot lineagenested loops become harder to reason abouta later refactor silently changes which stage owns a slotA kernel may still look fine while the lifetime model is already broken.4. Preferred counter layoutName counters by stage ownership, not by generic sequence order. Good examples:l1_cntl0c_cnttile_cntstage1_cntstage2_cntBad pattern:one global counter reused everywhere because it was convenient once5. Nested-loop ruleAcross different loop levels, do not reuse one counter for differentDBuff/TBufflifetimes. If two loop levels each own their own buffering rhythm, give them separate counters.This matters especially when:one loop streams operandsa parent loop owns an output tilea delayed consumer runs one iteration later than the producer6. Delayed-stage ruleIf a stage consumes data one iteration later, treat that as a distinct lifetime unless the entire slot family truly remains one coherent delayed pipeline.Stable pattern in this repository:one counter for immediate score/probability productiona different counter for delayedp valueaccumulationDo not hide a delayed lifetime behind the producer counter.Buffer depth must match the delayed overlap windowDo not assume one-tile delaymeansDBuffis enough.Counterexample:tilejis produced into slot0tilej1is produced into slot1the delayed stage is still consuming tilejthe producer is already free to start tilej2If tilej2also maps to slot0, aDBufffamily can overwrite the still-live data for tilej.Stable rule:when an operand family is reused by a delayed stage, size the local buffer family from the real producer/consumer overlap, not from the nominal delay count alonefor the common produce j,consume j-1, producer may beginj2early pattern, the reused operand family often needsTBuffConcrete example:stage 1 usesk_jforq k_j^Tdelayed stage later reuses that samek_jfordqk_j k_jkmay needTBuffeven thoughvcan still remainDBuff6a. Chunked vec stages usually need separate input and output countersWhen one vec stage is split into row chunks such as32 x 128or16 x 128, do not assume one counter should drive every local family just because all chunks advance in the same loop.Stable pattern:one counter for theMTE2 - Vinput families such asqkbuf/dpbufone counter for theV - MTE3output families such aspbuf/dqkbufEven if both counters increment once per chunk today, keeping them separate is still the safer authoring rule because:the slot families belong to different autosync pairingslater refactors often change one sides lifetime firstreusing one counter can blur which family actually owns a problematic slotPractical a2 lesson:vec_in_cntandvec_out_cntmade the per-chunkMTE2 - V - MTE3story explicittrying to treat a live input family as scratch for output preparation made the lineage much harder to reason about6b. Scratch tensors must shrink with the chunk, not just the loopsIf you squeeze a stage from32 x 128down to16 x 128, update the scratch tensors and helper assumptions too.Do not assume a larger scratch tensor can always be safely reused through a smaller sliced view.Practical failure modes:simulator-v2 storage/view validation failuresub_to_ubsize mismatches such asneed16384 bytes, have8192 byteshidden helper assumptions that still operate on the old full-chunk shapeStable rule:shape dedicated scratch tensors from the current chunk sizekeep helper-local metadata tensors such asquant_meta/quant_scaleon the same[chunk_m, TILE_N]shape as the real chunkrevalidate UB usage after every chunk-size change; the safe chunk is an implementation result, not a constant you can cargo-cultPractical a2 lesson:32 x 128chunking worked for the hif8 bring-up but left UB tight16 x 128plus dedicated chunk-sized quant scratch reduced pressure and kept the pipeline warning-free7. Same-pipe grouping still mattersCounter correctness is easier to preserve when same-pipe instructions are grouped together. Within one loop iteration, prefer:grouped MTE loadsgrouped computegrouped writebackThis makes ownership and stage boundaries much more obvious.8. Mutex depth must match the real slot countCvMutex/VcMutexdepthis not a vague pipeline-tuning hint. In this repository it directly controls how many initial ready tokens the kernel gets for that intra-core handoff.Concrete implementation fact:kernel build injects one initialvec_ready/cube_readyper unit ofdepthsodepthNmeans the producer may have up toNpayloads in flight before the consumer has to return capacityStable rule:plainTensorguarded by a mutex meansdepth1DBuffmay justifydepth2TBuffmay justifydepth3QBuffmay justifydepth4Important boundary:those2/3/4values are upper bounds from the physical slot family, not defaultsonly use them when the producer/consumer really rotate across those distinct slotsif the code keeps reusing one fixed view or one effective slot,depthmust stay1even if the object type isDBuff/TBuff/QBuffWhy this matters:settingdepth2on a single-bufferTensortells the runtime there are two free slotsthe producer can then publish two in-flight payloads even though both writes land on the same storagethat silently models overlap the kernel does not actually have and makes overwrite races look legalPractical test:count the actual distinct storage slots that can hold independent unconsumed payloadssetdepthto that number, not to the overlap you wish you had9.TBuffindexing rule in fused side-split kernelsIn the Python DSL, the triple-buffer class isTBuff. Its__getitem__already carries the buffer object and the raw counter to lowering.For fused cube/vec kernels, prefer:buf[stage_cnt]Stable pattern:keep the delayed-stage lineage in one counter such asstage2_cntindexTBuffdirectly with that counterlet the buffer family, not the kernel body, own the% 3slot mappingYou can still spell an explicit modulo slot when you really need to reuse that slot value:slot var_mod(stage_cnt, 3)buf[slot]Current parser behavior:side pruning now removes dead tmpvar_modchains that no longer feed the retained sideso inlinebuf[var_mod(stage_cnt, 3)]is no longer expected to fail by itselfEven so, direct indexing remains the preferred style because:the buffer family already knows it is triple-bufferedbuf[stage_cnt]keeps the lineage clearerit avoids redundant scalar instructions when the slot value is not reused anywhere else10. Quick checklistBefore accepting the counter design, verify:each counter has one clear stage ownerdifferent loop-owned lifetimes do not share one countersame-lifetime pairs share only when that pairing is intentionaldelayed consumers use a counter that matches their own lifetimedelayed reused operands have enough physical slots for the actual overlap windoweachCvMutex/VcMutexdepth matches the real number of simultaneously reusable slotsautosync grouping still matches the logical slot storyFiles to studyagent/example/kernels/a5/matmul_mknk_2dgrid_splitn.pyagent/example/kernels/a5/matmul_mknk_2dgrid_splitk.pyagent/example/kernels/a5/matmul_mknk_2dgrid_splitk_add1.pyagent/example/kernels/a5/matmul_rowwise_norm_large_nk.pyagent/example/kernels/a5/vec_cube_vec_scale2_abs_add1_matmul.pyagent/example/kernels/a5/test_mla_entire.py【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考