一、为什么需要自动调优1.1 手动调参的困境写完一个算子后性能往往不是最优的。需要调整的参数太多Tiling 大小每个维度切成多大的块循环展开unroll 几次向量化用多少个指令并行内存布局NHWC 还是 NCHW这些参数之间还有耦合——改了 Tiling 大小最优的展开次数也变了。手动搜索像在黑暗中摸索效率极低。1.2 自动调优的思路自动调优的本质是在参数空间中搜索最优配置定义搜索空间哪些参数范围是什么定义评估函数跑一遍看性能用搜索算法找到最优组合关键是搜索算法的选择——穷举太慢随机搜索太盲目智能搜索遗传算法、模拟退火、Ansor能在有限次数内找到接近最优的配置。二、搜索空间定义2.1 Tiling 参数空间fromdataclassesimportdataclassfromtypingimportList,TupledataclassclassTilingSpace:Tiling 搜索空间 定义每个维度的 Tiling 候选值。 候选值的选择依据: - L1/L2 缓存大小避免 cache miss - 向量化指令宽度如 128bit 16 个 float - NPU 的 Cube/Vector 单元数量 为什么不是连续的? 连续搜索太慢。实际中 Tiling 大小通常是 2 的幂或 16 的倍数 只需搜索这些有意义的值。 tile_sizes:List[List[int]]# 每个维度的候选值unroll_factors:List[int]# 循环展开候选次数vectorize_options:List[bool]# 是否启用向量化deftotal_combinations(self)-int:total1forsizesinself.tile_sizes:total*len(sizes)total*len(self.unroll_factors)total*len(self.vectorize_options)returntotal conv_spaceTilingSpace(tile_sizes[[16,32,64,128],# N (batch)[16,32,64],# C (channel)[4,8,16,32],# H[4,8,16,32],# W],unroll_factors[1,2,4,8],vectorize_options[True,False],)print(f搜索空间大小:{conv_space.total_combinations()}种组合)2.2 评估函数importtimeimportnumpyasnpdefevaluate_tiling_config(op_type,input_shape,config,devicenpu):评估一组 Tiling 配置的性能 评估方法: - 先跑 10 次 warmup 排除 cache 冷启动 - 再跑 100 次取中位数比平均值更稳定排除系统噪声 modelbuild_op_with_config(op_type,input_shape,config)modelmodel.to(device)# Warmupfor_inrange(10):model(input_data)# 正式测量latencies[]for_inrange(100):starttime.time()model(input_data)endtime.time()latencies.append((end-start)*1000)latency_msnp.median(latencies)memory_bytesget_peak_memory(device)returnlatency_ms,memory_bytes三、遗传算法搜索3.1 为什么用遗传算法离散空间Tiling 参数是离散的不适合梯度下降多参数耦合参数之间有依赖交叉操作能自然处理并行评估每一代的个体可以并行评估利用 NPU 多卡3.2 实现importrandomfromcopyimportdeepcopyclassGeneticTuning:遗传算法 Tiling 调优 流程: 1. 随机初始化种群N 个个体 2. 评估每个个体的适应度延迟的倒数 3. 选择适应度高的个体做父代 4. 交叉产生新个体 5. 变异引入多样性 6. 重复直到收敛或达到最大代数 参数调优建议: - population_size20, generations30 能覆盖大部分场景 - mutation_rate0.1 太高会破坏好基因太低会陷入局部最优 - crossover_rate0.8 保持较高交叉概率 def__init__(self,search_space:TilingSpace,population_size20,generations50,mutation_rate0.1,crossover_rate0.8,):self.spacesearch_space self.pop_sizepopulation_size self.generationsgenerations self.mutation_ratemutation_rate self.crossover_ratecrossover_rate self.history[]defrandom_individual(self):return{tile_sizes:[random.choice(s)forsinself.space.tile_sizes],unroll_factor:random.choice(self.space.unroll_factors),vectorize:random.choice(self.space.vectorize_options),}deffitness(self,individual,op_type,input_shape):latency,memoryevaluate_tiling_config(op_type,input_shape,individual)return1.0/(latency1e-6)defcrossover(self,parent1,parent2):childdeepcopy(parent1)foriinrange(len(child[tile_sizes])):ifrandom.random()0.5:child[tile_sizes][i]parent2[tile_sizes][i]ifrandom.random()0.5:child[unroll_factor]parent2[unroll_factor]ifrandom.random()0.5:child[vectorize]parent2[vectorize]returnchilddefmutate(self,individual):mutantdeepcopy(individual)foriinrange(len(mutant[tile_sizes])):ifrandom.random()self.mutation_rate:mutant[tile_sizes][i]random.choice(self.space.tile_sizes[i])ifrandom.random()self.mutation_rate:mutant[unroll_factor]random.choice(self.space.unroll_factors)ifrandom.random()self.mutation_rate:mutant[vectorize]random.choice(self.space.vectorize_options)returnmutantdeftournament_select(self,population,fitnesses,k3):selectedrandom.sample(list(zip(population,fitnesses)),k)returnmax(selected,keylambdax:x[1])[0]defsearch(self,op_type,input_shape):population[self.random_individual()for_inrange(self.pop_size)]best_configNonebest_fitness-1forgeninrange(self.generations):fitnesses[self.fitness(ind,op_type,input_shape)forindinpopulation]gen_best_idxmax(range(len(fitnesses)),keylambdai:fitnesses[i])iffitnesses[gen_best_idx]best_fitness:best_fitnessfitnesses[gen_best_idx]best_configdeepcopy(population[gen_best_idx])self.history.append({generation:gen,best_fitness:best_fitness,avg_fitness:sum(fitnesses)/len(fitnesses),})print(fGen{gen1}/{self.generations}: best{best_fitness:.6f}, favg{sum(fitnesses)/len(fitnesses):.6f})# 产生新一代new_population[deepcopy(population[gen_best_idx])]# 精英保留whilelen(new_population)self.pop_size:parent1self.tournament_select(population,fitnesses)parent2self.tournament_select(population,fitnesses)childself.crossover(parent1,parent2)ifrandom.random()self.crossover_rateelsedeepcopy(parent1)childself.mutate(child)new_population.append(child)populationnew_populationreturnbest_config,best_fitness,self.history四、Ansor 风格自动调度4.1 核心思路Ansor 的做法是先生成大量候选调度方案用代价模型快速筛选只对有希望的方案做实际测量。classAnsorStyleTuner:Ansor 风格的自动调优 与遗传算法的区别: - 遗传算法: 通过进化逐步逼近最优 - Ansor: 生成大量候选 → 代价模型预筛 → 只测 Top-K 代价模型: 用简单的启发式函数预测延迟不用真的跑。 虽然不精确但能快速排除明显差的方案。 def__init__(self,n_candidates1000,n_measured50):self.n_candidatesn_candidates self.n_measuredn_measureddefgenerate_candidates(self,compute_axis):candidates[]for_inrange(self.n_candidates):config{tile_sizes:[random.choice([4,8,16,32,64])for_inrange(len(compute_axis))],fuse:random.sample(compute_axis,krandom.randint(0,len(compute_axis)-1)),reorder:random.sample(compute_axis,klen(compute_axis)),unroll:random.choice([1,2,4,8]),}candidates.append(config)returncandidatesdefcost_model_predict(self,candidates):启发式代价模型: Tiling 适中 展开适中 好配置predictions[]forconfigincandidates:tile_prod1fortinconfig[tile_sizes]:tile_prod*t unroll_bonus1.0/(1config[unroll]*0.1)predictions.append(tile_prod*unroll_bonus)returnpredictionsdeftune(self,op_type,input_shape):candidatesself.generate_candidates([N,C,H,W])predictionsself.cost_model_predict(candidates)top_k_indicessorted(range(len(predictions)),keylambdai:predictions[i],reverseTrue)[:self.n_measured]results[]foridxintop_k_indices:latency,memoryevaluate_tiling_config(op_type,input_shape,candidates[idx])results.append({config:candidates[idx],latency:latency})bestmin(results,keylambdar:r[latency])returnbest[config],best[latency]五、实际调优流程defauto_tune_operator(op_type,input_shape,methodgenetic,devicenpu):自动调优入口函数 参数: method: genetic遗传算法、ansorAnsor 风格、random随机搜索 baseline print(fAuto-tuning{op_type}, shape{input_shape}, method{method})ifmethodgenetic:tunerGeneticTuning(search_spaceconv_space,population_size20,generations30)best_config,best_fitness,historytuner.search(op_type,input_shape)best_latency1.0/best_fitnesselifmethodansor:tunerAnsorStyleTuner(n_candidates500,n_measured30)best_config,best_latencytuner.tune(op_type,input_shape)elifmethodrandom:best_latencyfloat(inf)best_configNonefor_inrange(50):configconv_space.random_individual()latency,_evaluate_tiling_config(op_type,input_shape,config)iflatencybest_latency:best_latencylatency best_configconfigprint(fBest latency:{best_latency:.2f}ms)returnbest_config,best_latency六、调优结果可视化importmatplotlib.pyplotaspltdefplot_search_history(history):gens[h[generation]forhinhistory]best[h[best_fitness]forhinhistory]avg[h[avg_fitness]forhinhistory]plt.figure(figsize(10,6))plt.plot(gens,best,b-,linewidth2,labelGlobal Best)plt.plot(gens,avg,r:,alpha0.5,labelAverage)plt.xlabel(Generation)plt.ylabel(Fitness (1/Latency))plt.title(Genetic Algorithm Search History)plt.legend()plt.grid(True,alpha0.3)plt.savefig(tuning_history.png,dpi150,bbox_inchestight)plt.show()七、常见问题问题原因解决方案搜索太慢种群太大或评估函数太重减小种群用代价模型预筛选结果不稳定随机种子不同多次搜索取最优固定随机种子过拟合特定输入只对一种形状调优用多种输入形状做 cross-validation内存溢出Tiling 配置太大加入内存约束淘汰超限方案相关仓库TVM- 自动调度框架 https://github.com/apache/tvmAnsor- 自动调度论文 https://arxiv.org/abs/2006.06799CANN- 昇腾计算架构 https://gitee.com/ascend/cannkeras-tuner- 超参数调优框架 https://github.com/keras-team/keras-tuner
CANN 算子自动调优:从手动搜索到智能寻优
发布时间:2026/5/24 3:33:46
一、为什么需要自动调优1.1 手动调参的困境写完一个算子后性能往往不是最优的。需要调整的参数太多Tiling 大小每个维度切成多大的块循环展开unroll 几次向量化用多少个指令并行内存布局NHWC 还是 NCHW这些参数之间还有耦合——改了 Tiling 大小最优的展开次数也变了。手动搜索像在黑暗中摸索效率极低。1.2 自动调优的思路自动调优的本质是在参数空间中搜索最优配置定义搜索空间哪些参数范围是什么定义评估函数跑一遍看性能用搜索算法找到最优组合关键是搜索算法的选择——穷举太慢随机搜索太盲目智能搜索遗传算法、模拟退火、Ansor能在有限次数内找到接近最优的配置。二、搜索空间定义2.1 Tiling 参数空间fromdataclassesimportdataclassfromtypingimportList,TupledataclassclassTilingSpace:Tiling 搜索空间 定义每个维度的 Tiling 候选值。 候选值的选择依据: - L1/L2 缓存大小避免 cache miss - 向量化指令宽度如 128bit 16 个 float - NPU 的 Cube/Vector 单元数量 为什么不是连续的? 连续搜索太慢。实际中 Tiling 大小通常是 2 的幂或 16 的倍数 只需搜索这些有意义的值。 tile_sizes:List[List[int]]# 每个维度的候选值unroll_factors:List[int]# 循环展开候选次数vectorize_options:List[bool]# 是否启用向量化deftotal_combinations(self)-int:total1forsizesinself.tile_sizes:total*len(sizes)total*len(self.unroll_factors)total*len(self.vectorize_options)returntotal conv_spaceTilingSpace(tile_sizes[[16,32,64,128],# N (batch)[16,32,64],# C (channel)[4,8,16,32],# H[4,8,16,32],# W],unroll_factors[1,2,4,8],vectorize_options[True,False],)print(f搜索空间大小:{conv_space.total_combinations()}种组合)2.2 评估函数importtimeimportnumpyasnpdefevaluate_tiling_config(op_type,input_shape,config,devicenpu):评估一组 Tiling 配置的性能 评估方法: - 先跑 10 次 warmup 排除 cache 冷启动 - 再跑 100 次取中位数比平均值更稳定排除系统噪声 modelbuild_op_with_config(op_type,input_shape,config)modelmodel.to(device)# Warmupfor_inrange(10):model(input_data)# 正式测量latencies[]for_inrange(100):starttime.time()model(input_data)endtime.time()latencies.append((end-start)*1000)latency_msnp.median(latencies)memory_bytesget_peak_memory(device)returnlatency_ms,memory_bytes三、遗传算法搜索3.1 为什么用遗传算法离散空间Tiling 参数是离散的不适合梯度下降多参数耦合参数之间有依赖交叉操作能自然处理并行评估每一代的个体可以并行评估利用 NPU 多卡3.2 实现importrandomfromcopyimportdeepcopyclassGeneticTuning:遗传算法 Tiling 调优 流程: 1. 随机初始化种群N 个个体 2. 评估每个个体的适应度延迟的倒数 3. 选择适应度高的个体做父代 4. 交叉产生新个体 5. 变异引入多样性 6. 重复直到收敛或达到最大代数 参数调优建议: - population_size20, generations30 能覆盖大部分场景 - mutation_rate0.1 太高会破坏好基因太低会陷入局部最优 - crossover_rate0.8 保持较高交叉概率 def__init__(self,search_space:TilingSpace,population_size20,generations50,mutation_rate0.1,crossover_rate0.8,):self.spacesearch_space self.pop_sizepopulation_size self.generationsgenerations self.mutation_ratemutation_rate self.crossover_ratecrossover_rate self.history[]defrandom_individual(self):return{tile_sizes:[random.choice(s)forsinself.space.tile_sizes],unroll_factor:random.choice(self.space.unroll_factors),vectorize:random.choice(self.space.vectorize_options),}deffitness(self,individual,op_type,input_shape):latency,memoryevaluate_tiling_config(op_type,input_shape,individual)return1.0/(latency1e-6)defcrossover(self,parent1,parent2):childdeepcopy(parent1)foriinrange(len(child[tile_sizes])):ifrandom.random()0.5:child[tile_sizes][i]parent2[tile_sizes][i]ifrandom.random()0.5:child[unroll_factor]parent2[unroll_factor]ifrandom.random()0.5:child[vectorize]parent2[vectorize]returnchilddefmutate(self,individual):mutantdeepcopy(individual)foriinrange(len(mutant[tile_sizes])):ifrandom.random()self.mutation_rate:mutant[tile_sizes][i]random.choice(self.space.tile_sizes[i])ifrandom.random()self.mutation_rate:mutant[unroll_factor]random.choice(self.space.unroll_factors)ifrandom.random()self.mutation_rate:mutant[vectorize]random.choice(self.space.vectorize_options)returnmutantdeftournament_select(self,population,fitnesses,k3):selectedrandom.sample(list(zip(population,fitnesses)),k)returnmax(selected,keylambdax:x[1])[0]defsearch(self,op_type,input_shape):population[self.random_individual()for_inrange(self.pop_size)]best_configNonebest_fitness-1forgeninrange(self.generations):fitnesses[self.fitness(ind,op_type,input_shape)forindinpopulation]gen_best_idxmax(range(len(fitnesses)),keylambdai:fitnesses[i])iffitnesses[gen_best_idx]best_fitness:best_fitnessfitnesses[gen_best_idx]best_configdeepcopy(population[gen_best_idx])self.history.append({generation:gen,best_fitness:best_fitness,avg_fitness:sum(fitnesses)/len(fitnesses),})print(fGen{gen1}/{self.generations}: best{best_fitness:.6f}, favg{sum(fitnesses)/len(fitnesses):.6f})# 产生新一代new_population[deepcopy(population[gen_best_idx])]# 精英保留whilelen(new_population)self.pop_size:parent1self.tournament_select(population,fitnesses)parent2self.tournament_select(population,fitnesses)childself.crossover(parent1,parent2)ifrandom.random()self.crossover_rateelsedeepcopy(parent1)childself.mutate(child)new_population.append(child)populationnew_populationreturnbest_config,best_fitness,self.history四、Ansor 风格自动调度4.1 核心思路Ansor 的做法是先生成大量候选调度方案用代价模型快速筛选只对有希望的方案做实际测量。classAnsorStyleTuner:Ansor 风格的自动调优 与遗传算法的区别: - 遗传算法: 通过进化逐步逼近最优 - Ansor: 生成大量候选 → 代价模型预筛 → 只测 Top-K 代价模型: 用简单的启发式函数预测延迟不用真的跑。 虽然不精确但能快速排除明显差的方案。 def__init__(self,n_candidates1000,n_measured50):self.n_candidatesn_candidates self.n_measuredn_measureddefgenerate_candidates(self,compute_axis):candidates[]for_inrange(self.n_candidates):config{tile_sizes:[random.choice([4,8,16,32,64])for_inrange(len(compute_axis))],fuse:random.sample(compute_axis,krandom.randint(0,len(compute_axis)-1)),reorder:random.sample(compute_axis,klen(compute_axis)),unroll:random.choice([1,2,4,8]),}candidates.append(config)returncandidatesdefcost_model_predict(self,candidates):启发式代价模型: Tiling 适中 展开适中 好配置predictions[]forconfigincandidates:tile_prod1fortinconfig[tile_sizes]:tile_prod*t unroll_bonus1.0/(1config[unroll]*0.1)predictions.append(tile_prod*unroll_bonus)returnpredictionsdeftune(self,op_type,input_shape):candidatesself.generate_candidates([N,C,H,W])predictionsself.cost_model_predict(candidates)top_k_indicessorted(range(len(predictions)),keylambdai:predictions[i],reverseTrue)[:self.n_measured]results[]foridxintop_k_indices:latency,memoryevaluate_tiling_config(op_type,input_shape,candidates[idx])results.append({config:candidates[idx],latency:latency})bestmin(results,keylambdar:r[latency])returnbest[config],best[latency]五、实际调优流程defauto_tune_operator(op_type,input_shape,methodgenetic,devicenpu):自动调优入口函数 参数: method: genetic遗传算法、ansorAnsor 风格、random随机搜索 baseline print(fAuto-tuning{op_type}, shape{input_shape}, method{method})ifmethodgenetic:tunerGeneticTuning(search_spaceconv_space,population_size20,generations30)best_config,best_fitness,historytuner.search(op_type,input_shape)best_latency1.0/best_fitnesselifmethodansor:tunerAnsorStyleTuner(n_candidates500,n_measured30)best_config,best_latencytuner.tune(op_type,input_shape)elifmethodrandom:best_latencyfloat(inf)best_configNonefor_inrange(50):configconv_space.random_individual()latency,_evaluate_tiling_config(op_type,input_shape,config)iflatencybest_latency:best_latencylatency best_configconfigprint(fBest latency:{best_latency:.2f}ms)returnbest_config,best_latency六、调优结果可视化importmatplotlib.pyplotaspltdefplot_search_history(history):gens[h[generation]forhinhistory]best[h[best_fitness]forhinhistory]avg[h[avg_fitness]forhinhistory]plt.figure(figsize(10,6))plt.plot(gens,best,b-,linewidth2,labelGlobal Best)plt.plot(gens,avg,r:,alpha0.5,labelAverage)plt.xlabel(Generation)plt.ylabel(Fitness (1/Latency))plt.title(Genetic Algorithm Search History)plt.legend()plt.grid(True,alpha0.3)plt.savefig(tuning_history.png,dpi150,bbox_inchestight)plt.show()七、常见问题问题原因解决方案搜索太慢种群太大或评估函数太重减小种群用代价模型预筛选结果不稳定随机种子不同多次搜索取最优固定随机种子过拟合特定输入只对一种形状调优用多种输入形状做 cross-validation内存溢出Tiling 配置太大加入内存约束淘汰超限方案相关仓库TVM- 自动调度框架 https://github.com/apache/tvmAnsor- 自动调度论文 https://arxiv.org/abs/2006.06799CANN- 昇腾计算架构 https://gitee.com/ascend/cannkeras-tuner- 超参数调优框架 https://github.com/keras-team/keras-tuner