CANN-昇腾NPU-推理服务灰度发布-怎么平滑切换版本 灰度发布Canary Deployment把新版本先给 5% 用户用没问题再全量。在昇腾NPU推理服务上灰度发布需要解决流量切分、效果对比、快速回滚。灰度策略策略 1随机切分importrandom MODEL_VERSIONS{v1.0:0.95,# 95% 流量v1.1:0.05,# 5% 流量灰度}defselect_model():rrandom.random()cumulative0forversion,ratioinMODEL_VERSIONS.items():cumulativeratioifrcumulative:returnversionreturnv1.0app.post(/generate)asyncdefgenerate(prompt:str):versionselect_model()modelmodels[version]resultawaitmodel.generate(prompt)return{result:result,version:version}策略 2用户 ID 哈希切分defselect_model_by_user(user_id:str):# 同一用户始终落到同一版本体验一致hash_valhash(user_id)%100ifhash_val5:# 5% 灰度returnv1.1returnv1.0策略 3请求特征切分defselect_model_by_request(prompt:str):# 短 prompt 用新版本长 prompt 用旧版本新版本长序列支持不稳定iflen(prompt)512:returnv1.1returnv1.0K8s 灰度配置# 95% 流量到 v1.0apiVersion:networking.k8s.io/v1kind:Ingressmetadata:name:inference-v1-canaryspec:rules:-http:paths:-path:/pathType:Prefixbackend:service:name:inference-v1port:number:8000---# 5% 流量到 v1.1灰度apiVersion:networking.k8s.io/v1kind:Ingressmetadata:name:inference-v1-canaryannotations:nginx.ingress.kubernetes.io/canary:truenginx.ingress.kubernetes.io/canary-weight:5spec:rules:-http:paths:-path:/pathType:Prefixbackend:service:name:inference-v1-canaryport:number:8000效果对比灰度期间对比两个版本的关键指标fromprometheus_clientimportCounter,Histogram# 定义指标requests_totalCounter(inference_requests_total,Total requests,[version])latency_histogramHistogram(inference_latency_seconds,Latency,[version])error_counterCounter(inference_errors_total,Errors,[version])app.post(/generate)asyncdefgenerate(prompt:str):versionselect_model()withlatency_histogram.labels(versionversion).time():try:resultawaitmodels[version].generate(prompt)requests_total.labels(versionversion).inc()returnresultexceptExceptionase:error_counter.labels(versionversion).inc()raiseGrafana 面板对比panels:-title:错误率对比expr:|rate(inference_errors_total{versionv1.1}[5m]) / rate(inference_requests_total{versionv1.1}[5m])type:graph-title:延迟 P99 对比expr:|histogram_quantile(0.99, rate(inference_latency_seconds_bucket{versionv1.1}[5m]) )type:graph灰度通过标准错误率不超过 v1.0 的 1.2×P99 延迟不超过 v1.0 的 1.1×用户反馈满意度 ≥ v1.0快速回滚灰度发现问题时立即回滚# K8s 调整灰度权重到 0kubectl patch ingress inference-v1-canary\--typejson\-p[{op: replace, path: /metadata/annotations/nginx.ingress.kubernetes.io~1canary-weight, value: 0}]ATB 的模型热切换也可以做回滚# 灰度版本出错热切换回旧版本model.reload(model_v1.0.om)自动化灰度用 Prometheus Alertmanager 做自动化灰度# Prometheus 告警规则groups:-name:canary_alertsrules:-alert:CanaryHighErrorRateexpr:|rate(inference_errors_total{versionv1.1}[5m]) / rate(inference_errors_total{versionv1.0}[5m]) 1.2for:2mannotations:summary:灰度版本错误率过高自动回滚# Alertmanager 触发 webhook调用回滚 APIapp.post(/webhook/rollback)asyncdefrollback():# 调整 K8s Ingress 权重os.system(kubectl patch ingress ...)# 灰度权重 → 0return{status:rolled back}灰度发布是在线服务更新的最佳实践。随机切分 5% 流量 → 对比效果 → 无问题全量。K8s Ingress 做流量切分Prometheus 做效果监控自动回滚防止事故扩大。仓库在这里https://atomgit.com/cann/ATB