参考代码级迁移常见模式按需阅读【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills无主流程 §。与 part-04-code-migration.md §5.2§5.3 配合命令见 part-07-commands.md。1. PyTorch统一 device 抽象推荐避免散落.cuda()/.npu()在工程内集中import torch def _npu_available() - bool: try: import torch_npu # noqa: F401 # side effect: registers torch.npu backend return torch.npu.is_available() except ImportError: return False def get_device(prefer: str auto) - torch.device: prefer: cpu | cuda | npu | auto默认 autonpu → cuda → cpu prefer prefer.lower() if prefer cpu: return torch.device(cpu) if prefer npu: return torch.device(npu:0) if _npu_available() else torch.device(cpu) if prefer cuda: return torch.device(cuda:0) if torch.cuda.is_available() else torch.device(cpu) if prefer ! auto: raise ValueError(funknown prefer{prefer!r}; use cpu | cuda | npu | auto) if _npu_available(): return torch.device(npu:0) if torch.cuda.is_available(): return torch.device(cuda:0) return torch.device(cpu) device get_device(auto) # NPU 迁移默认 # device get_device(cuda) # 仅跑 GPU baseline 时用不会误选 NPU model.to(device) tensor tensor.to(device, non_blockingFalse)prefer行为cpu固定 CPUnpu优先 NPU不可用则回退 CPU不会选 CUDAcuda优先 CUDA不可用则回退 CPU不会选 NPU适合补 GPU baselineauto默认NPU → CUDA → CPU与 NPU 迁移主路径一致落盘在Mig_report§5.1 注明是否新增device_utils及调用点。2. PyTorchCUDA → NPU 对照表CUDA 写法NPU 常见改法备注tensor.cuda()tensor.npu()或.to(npu:0)须已import torch_nputorch.cuda.device(i)torch.npu.device(i)torch.cuda.synchronize()torch.npu.synchronize()性能 profiling 口径一致torch.cuda.amp.autocasttorch.npu.amp.autocast以当前 torch_npu 文档为准GradScaler()torch.npu.amp.GradScaler()核对是否启用pin_memoryTrue通常FalseDataLoaderbackendncclbackendhccl分布式CUDA_VISIBLE_DEVICESASCEND_RT_VISIBLE_DEVICES多卡可见性检索命令改前扫描rg -n \.cuda\(|cuda:|torch\.cuda|CUDA_VISIBLE|nccl --glob *.py3. PyTorch单卡训练 loop 最小改动import torch import torch_npu # noqa: F401 # side effect: must import before creating npu tensors device torch.device(npu:0) model model.to(device) optimizer torch.optim.Adam(model.parameters()) scaler torch.npu.amp.GradScaler(enableduse_amp) for batch in loader: inputs batch[image].to(device, non_blockingFalse) labels batch[label].to(device, non_blockingFalse) optimizer.zero_grad(set_to_noneTrue) with torch.npu.amp.autocast(enableduse_amp): loss model(inputs, labels) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()smoke跑 1 个 batch检查loss.item()有限且无 NaN → 记入Mig_report§6。4. PyTorchHCCL 分布式初始化示意import os import torch import torch.distributed as dist import torch_npu # noqa: F401 # side effect: registers torch.npu backend def init_dist(): rank int(os.environ[RANK]) local_rank int(os.environ[LOCAL_RANK]) torch.npu.set_device(local_rank) dist.init_process_group(backendhccl) return rank, local_rank启动命令见 part-07-commands.md「多卡 HCCL」。5. 自定义算子 / CUDA 扩展处置顺序查项目与 CANN 是否已有 Ascend 算子或torch_npu融合 API小算子改CPU 回退在 forward 内x.cpu()算完再.to(device)注明性能替换为等价torch算子组合仍不可行 →Mig_report§7 记录回流 part-02 / part-06def safe_op(x): if x.device.type npu: return legacy_cpu_impl(x.cpu()).to(x.device) return legacy_cpu_impl(x)6. MindSpore上下文与入口import mindspore as ms from mindspore import context context.set_context(modecontext.GRAPH_MODE, device_targetAscend, device_id0) def train_step(data, label): loss network(data, label) return loss动图调试可临时PYNATIVE_MODEsmoke再切回GRAPH_MODE做性能评测变更写入Mig_report§5.4。7. 预处理与 Golden 对齐Golden 对比前必须一致resize / crop / normalize 参数与基线同一实现勿 NPU 侧改顺序NCHWvsNHWC、mean/std 数值固定torch.manual_seed/numpyseed容差按目标精度设定默认 FP16rtol1e-2~1e-3、atol~1e-3勿照抄 FP32 级atol 1e-5→ 见Compare§3.1、part-07输出对比记录shape、max abs diff、mean abs diff、所用 rtol/atol →Compare§3.1、Mig_report§6。关联索引主清单part-04-code-migration.md命令part-07-commands.md排障part-09-examples-troubleshooting.md【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考
CANN科学模型NPU迁移代码模式
发布时间:2026/7/4 8:17:02
参考代码级迁移常见模式按需阅读【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills无主流程 §。与 part-04-code-migration.md §5.2§5.3 配合命令见 part-07-commands.md。1. PyTorch统一 device 抽象推荐避免散落.cuda()/.npu()在工程内集中import torch def _npu_available() - bool: try: import torch_npu # noqa: F401 # side effect: registers torch.npu backend return torch.npu.is_available() except ImportError: return False def get_device(prefer: str auto) - torch.device: prefer: cpu | cuda | npu | auto默认 autonpu → cuda → cpu prefer prefer.lower() if prefer cpu: return torch.device(cpu) if prefer npu: return torch.device(npu:0) if _npu_available() else torch.device(cpu) if prefer cuda: return torch.device(cuda:0) if torch.cuda.is_available() else torch.device(cpu) if prefer ! auto: raise ValueError(funknown prefer{prefer!r}; use cpu | cuda | npu | auto) if _npu_available(): return torch.device(npu:0) if torch.cuda.is_available(): return torch.device(cuda:0) return torch.device(cpu) device get_device(auto) # NPU 迁移默认 # device get_device(cuda) # 仅跑 GPU baseline 时用不会误选 NPU model.to(device) tensor tensor.to(device, non_blockingFalse)prefer行为cpu固定 CPUnpu优先 NPU不可用则回退 CPU不会选 CUDAcuda优先 CUDA不可用则回退 CPU不会选 NPU适合补 GPU baselineauto默认NPU → CUDA → CPU与 NPU 迁移主路径一致落盘在Mig_report§5.1 注明是否新增device_utils及调用点。2. PyTorchCUDA → NPU 对照表CUDA 写法NPU 常见改法备注tensor.cuda()tensor.npu()或.to(npu:0)须已import torch_nputorch.cuda.device(i)torch.npu.device(i)torch.cuda.synchronize()torch.npu.synchronize()性能 profiling 口径一致torch.cuda.amp.autocasttorch.npu.amp.autocast以当前 torch_npu 文档为准GradScaler()torch.npu.amp.GradScaler()核对是否启用pin_memoryTrue通常FalseDataLoaderbackendncclbackendhccl分布式CUDA_VISIBLE_DEVICESASCEND_RT_VISIBLE_DEVICES多卡可见性检索命令改前扫描rg -n \.cuda\(|cuda:|torch\.cuda|CUDA_VISIBLE|nccl --glob *.py3. PyTorch单卡训练 loop 最小改动import torch import torch_npu # noqa: F401 # side effect: must import before creating npu tensors device torch.device(npu:0) model model.to(device) optimizer torch.optim.Adam(model.parameters()) scaler torch.npu.amp.GradScaler(enableduse_amp) for batch in loader: inputs batch[image].to(device, non_blockingFalse) labels batch[label].to(device, non_blockingFalse) optimizer.zero_grad(set_to_noneTrue) with torch.npu.amp.autocast(enableduse_amp): loss model(inputs, labels) scaler.scale(loss).backward() scaler.step(optimizer) scaler.update()smoke跑 1 个 batch检查loss.item()有限且无 NaN → 记入Mig_report§6。4. PyTorchHCCL 分布式初始化示意import os import torch import torch.distributed as dist import torch_npu # noqa: F401 # side effect: registers torch.npu backend def init_dist(): rank int(os.environ[RANK]) local_rank int(os.environ[LOCAL_RANK]) torch.npu.set_device(local_rank) dist.init_process_group(backendhccl) return rank, local_rank启动命令见 part-07-commands.md「多卡 HCCL」。5. 自定义算子 / CUDA 扩展处置顺序查项目与 CANN 是否已有 Ascend 算子或torch_npu融合 API小算子改CPU 回退在 forward 内x.cpu()算完再.to(device)注明性能替换为等价torch算子组合仍不可行 →Mig_report§7 记录回流 part-02 / part-06def safe_op(x): if x.device.type npu: return legacy_cpu_impl(x.cpu()).to(x.device) return legacy_cpu_impl(x)6. MindSpore上下文与入口import mindspore as ms from mindspore import context context.set_context(modecontext.GRAPH_MODE, device_targetAscend, device_id0) def train_step(data, label): loss network(data, label) return loss动图调试可临时PYNATIVE_MODEsmoke再切回GRAPH_MODE做性能评测变更写入Mig_report§5.4。7. 预处理与 Golden 对齐Golden 对比前必须一致resize / crop / normalize 参数与基线同一实现勿 NPU 侧改顺序NCHWvsNHWC、mean/std 数值固定torch.manual_seed/numpyseed容差按目标精度设定默认 FP16rtol1e-2~1e-3、atol~1e-3勿照抄 FP32 级atol 1e-5→ 见Compare§3.1、part-07输出对比记录shape、max abs diff、mean abs diff、所用 rtol/atol →Compare§3.1、Mig_report§6。关联索引主清单part-04-code-migration.md命令part-07-commands.md排障part-09-examples-troubleshooting.md【免费下载链接】cannbot-skillsCANNBot 是面向 CANN 开发的用于提升开发效率的系列智能体本仓库为其提供可复用的 Skills 模块。项目地址: https://gitcode.com/cann/cannbot-skills创作声明:本文部分内容由AI辅助生成(AIGC),仅供参考