Kubernetes 与监控系统集成最佳实践一、前言哥们别整那些花里胡哨的。监控系统是 Kubernetes 集群的重要组成部分今天直接上硬货教你如何在 Kubernetes 中集成各种监控系统。二、监控系统类型类型适用场景优势劣势Prometheus指标监控强大的查询语言存储成本高Grafana可视化丰富的图表配置复杂Alertmanager告警管理灵活的告警规则学习成本高Loki日志管理轻量级功能有限Jaeger分布式追踪可视化链路资源消耗大三、实战配置1. Prometheus 配置apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: serviceAccountName: prometheus serviceMonitorSelector: matchLabels: team: frontend resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web storage: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 100Gi2. Grafana 配置apiVersion: apps/v1 kind: Deployment metadata: name: grafana namespace: monitoring spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:latest ports: - containerPort: 3000 env: - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana-secret key: password volumeMounts: - name: grafana-storage mountPath: /var/lib/grafana - name: grafana-dashboards mountPath: /var/lib/grafana/dashboards volumes: - name: grafana-storage persistentVolumeClaim: claimName: grafana-storage - name: grafana-dashboards configMap: name: grafana-dashboards --- apiVersion: v1 kind: Service metadata: name: grafana namespace: monitoring spec: selector: app: grafana ports: - port: 3000 targetPort: 3000 type: LoadBalancer3. Alertmanager 配置apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: alertmanager namespace: monitoring spec: serviceAccountName: alertmanager config: route: groupBy: [alertname] groupWait: 30s groupInterval: 5m repeatInterval: 1h receiver: email receivers: - name: email emailConfigs: - to: susuexample.com from: alertmanagerexample.com smarthost: smtp.example.com:587 authUsername: alertmanager authPassword: name: smtp-credentials key: password storage: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 10Gi --- apiVersion: v1 kind: Service metadata: name: alertmanager namespace: monitoring spec: selector: app: alertmanager ports: - port: 9093 targetPort: 90934. Loki 配置apiVersion: apps/v1 kind: Deployment metadata: name: loki namespace: monitoring spec: replicas: 1 selector: matchLabels: app: loki template: metadata: labels: app: loki spec: containers: - name: loki image: grafana/loki:latest ports: - containerPort: 3100 volumeMounts: - name: loki-config mountPath: /etc/loki - name: loki-storage mountPath: /loki volumes: - name: loki-config configMap: name: loki-config - name: loki-storage persistentVolumeClaim: claimName: loki-storage --- apiVersion: v1 kind: Service metadata: name: loki namespace: monitoring spec: selector: app: loki ports: - port: 3100 targetPort: 3100四、监控系统优化1. 指标优化apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-metrics namespace: monitoring spec: selector: matchLabels: app: app endpoints: - port: metrics interval: 15s scrapeTimeout: 10s metricRelabelings: - sourceLabels: [__name__] regex: ^(http_requests_total|http_request_duration_seconds_sum|http_request_duration_seconds_count)$ action: keep2. 告警优化apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: optimized-alerts namespace: monitoring spec: groups: - name: optimized rules: - alert: PodRestarting expr: rate(kube_pod_container_status_restarts_total[15m]) 3 for: 5m labels: severity: critical annotations: summary: Pod Restarting Frequently description: Pod {{ $labels.pod }} is restarting frequently - alert: NodeDiskPressure expr: kube_node_status_condition{conditionDiskPressure,statustrue} 1 for: 10m labels: severity: critical annotations: summary: Node Disk Pressure description: Node {{ $labels.node }} has disk pressure3. 可视化优化创建 Grafana 仪表板包含以下面板集群资源使用情况Pod 状态监控网络流量分析存储使用监控应用性能指标告警状态五、常见问题1. 监控数据丢失解决方案配置持久化存储调整 Prometheus 存储参数实施数据备份策略2. 告警风暴解决方案配置合理的告警规则使用告警分组实施告警抑制3. 监控性能问题解决方案优化指标采集频率使用指标过滤增加监控系统资源配置六、最佳实践总结全栈监控覆盖基础设施、集群、应用三个层次告警策略配置合理的告警规则和级别可视化构建直观的监控面板性能优化合理配置监控资源集成统一使用 OpenTelemetry 实现全栈可观测性自动化实施监控配置的版本控制七、总结Kubernetes 与监控系统集成是保障集群稳定运行的重要手段。按照本文的最佳实践你可以构建一个全面、高效的监控系统炸了
Kubernetes 与监控系统集成最佳实践
发布时间:2026/5/21 12:51:27
Kubernetes 与监控系统集成最佳实践一、前言哥们别整那些花里胡哨的。监控系统是 Kubernetes 集群的重要组成部分今天直接上硬货教你如何在 Kubernetes 中集成各种监控系统。二、监控系统类型类型适用场景优势劣势Prometheus指标监控强大的查询语言存储成本高Grafana可视化丰富的图表配置复杂Alertmanager告警管理灵活的告警规则学习成本高Loki日志管理轻量级功能有限Jaeger分布式追踪可视化链路资源消耗大三、实战配置1. Prometheus 配置apiVersion: monitoring.coreos.com/v1 kind: Prometheus metadata: name: prometheus namespace: monitoring spec: serviceAccountName: prometheus serviceMonitorSelector: matchLabels: team: frontend resources: requests: memory: 400Mi ruleSelector: matchLabels: prometheus: k8s role: alert-rules alerting: alertmanagers: - namespace: monitoring name: alertmanager port: web storage: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 100Gi2. Grafana 配置apiVersion: apps/v1 kind: Deployment metadata: name: grafana namespace: monitoring spec: replicas: 1 selector: matchLabels: app: grafana template: metadata: labels: app: grafana spec: containers: - name: grafana image: grafana/grafana:latest ports: - containerPort: 3000 env: - name: GF_SECURITY_ADMIN_PASSWORD valueFrom: secretKeyRef: name: grafana-secret key: password volumeMounts: - name: grafana-storage mountPath: /var/lib/grafana - name: grafana-dashboards mountPath: /var/lib/grafana/dashboards volumes: - name: grafana-storage persistentVolumeClaim: claimName: grafana-storage - name: grafana-dashboards configMap: name: grafana-dashboards --- apiVersion: v1 kind: Service metadata: name: grafana namespace: monitoring spec: selector: app: grafana ports: - port: 3000 targetPort: 3000 type: LoadBalancer3. Alertmanager 配置apiVersion: monitoring.coreos.com/v1 kind: Alertmanager metadata: name: alertmanager namespace: monitoring spec: serviceAccountName: alertmanager config: route: groupBy: [alertname] groupWait: 30s groupInterval: 5m repeatInterval: 1h receiver: email receivers: - name: email emailConfigs: - to: susuexample.com from: alertmanagerexample.com smarthost: smtp.example.com:587 authUsername: alertmanager authPassword: name: smtp-credentials key: password storage: volumeClaimTemplate: spec: storageClassName: standard resources: requests: storage: 10Gi --- apiVersion: v1 kind: Service metadata: name: alertmanager namespace: monitoring spec: selector: app: alertmanager ports: - port: 9093 targetPort: 90934. Loki 配置apiVersion: apps/v1 kind: Deployment metadata: name: loki namespace: monitoring spec: replicas: 1 selector: matchLabels: app: loki template: metadata: labels: app: loki spec: containers: - name: loki image: grafana/loki:latest ports: - containerPort: 3100 volumeMounts: - name: loki-config mountPath: /etc/loki - name: loki-storage mountPath: /loki volumes: - name: loki-config configMap: name: loki-config - name: loki-storage persistentVolumeClaim: claimName: loki-storage --- apiVersion: v1 kind: Service metadata: name: loki namespace: monitoring spec: selector: app: loki ports: - port: 3100 targetPort: 3100四、监控系统优化1. 指标优化apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-metrics namespace: monitoring spec: selector: matchLabels: app: app endpoints: - port: metrics interval: 15s scrapeTimeout: 10s metricRelabelings: - sourceLabels: [__name__] regex: ^(http_requests_total|http_request_duration_seconds_sum|http_request_duration_seconds_count)$ action: keep2. 告警优化apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: optimized-alerts namespace: monitoring spec: groups: - name: optimized rules: - alert: PodRestarting expr: rate(kube_pod_container_status_restarts_total[15m]) 3 for: 5m labels: severity: critical annotations: summary: Pod Restarting Frequently description: Pod {{ $labels.pod }} is restarting frequently - alert: NodeDiskPressure expr: kube_node_status_condition{conditionDiskPressure,statustrue} 1 for: 10m labels: severity: critical annotations: summary: Node Disk Pressure description: Node {{ $labels.node }} has disk pressure3. 可视化优化创建 Grafana 仪表板包含以下面板集群资源使用情况Pod 状态监控网络流量分析存储使用监控应用性能指标告警状态五、常见问题1. 监控数据丢失解决方案配置持久化存储调整 Prometheus 存储参数实施数据备份策略2. 告警风暴解决方案配置合理的告警规则使用告警分组实施告警抑制3. 监控性能问题解决方案优化指标采集频率使用指标过滤增加监控系统资源配置六、最佳实践总结全栈监控覆盖基础设施、集群、应用三个层次告警策略配置合理的告警规则和级别可视化构建直观的监控面板性能优化合理配置监控资源集成统一使用 OpenTelemetry 实现全栈可观测性自动化实施监控配置的版本控制七、总结Kubernetes 与监控系统集成是保障集群稳定运行的重要手段。按照本文的最佳实践你可以构建一个全面、高效的监控系统炸了