环境信息OSRocky Linux 9.7etcd 版本v3.6.7节点规划3节点集群满足 Raft 多数派协议最多容忍 1 节点故障Zabbix Server192.168.44.135端口 8080Prometheus192.168.44.135端口 9090一、节点规划节点名IP 地址角色node1192.168.44.132etcd membernode2192.168.44.133etcd membernode3192.168.44.134etcd member初始 Leader二、前置准备2.1 时间同步所有节点etcd 集群对时间一致性有严格要求节点间时间偏差需保持在 500ms 以内。bash复制# 安装并启动 chrony yum install -y chrony systemctl enable chronyd systemctl start chronyd # 手动同步时间 timedatectl set-ntp true chronyc tracking2.2 安装依赖工具所有节点bash复制# 安装 tarRocky Linux 9 minimal 版本默认不带 tar yum install -y tar2.3 准备安装包从本地 Windows 机器192.168.44.1通过 HTTP 服务分发安装包bash复制# 在 Windows 上启动临时 HTTP 服务Downloads 目录下 python -m http.server 18888 --directory C:\Users\qiyongquan\Downloads # 在各 Linux 节点下载 curl -o /tmp/etcd-v3.6.7-linux-amd64.tar.gz http://192.168.44.1:18888/etcd-v3.6.7-linux-amd64.tar.gz三、etcd 安装以下步骤在所有三台节点上执行节点间配置仅 IP 和名称不同3.1 创建系统用户和目录bash复制# 创建 etcd 专用系统用户无登录 shell useradd -r -s /sbin/nologin etcd # 创建数据目录、日志目录、配置目录 mkdir -p /etc/etcd /var/lib/etcd /var/log/etcd3.2 安装二进制文件bash复制# 解压安装包注意需先安装 tar cd /tmp tar xzf etcd-v3.6.7-linux-amd64.tar.gz # 复制二进制到系统路径 cp /tmp/etcd-v3.6.7-linux-amd64/etcd \ /tmp/etcd-v3.6.7-linux-amd64/etcdctl \ /tmp/etcd-v3.6.7-linux-amd64/etcdutl \ /usr/local/bin/ chmod x /usr/local/bin/etcd /usr/local/bin/etcdctl /usr/local/bin/etcdutl # 验证版本 etcd --version # etcd Version: 3.6.73.3 创建配置文件node1192.168.44.132bash复制cat /etc/etcd/etcd.conf EOF # ── 基础标识 ────────────────────────────────────────────── ETCD_NAMEnode1 ETCD_DATA_DIR/var/lib/etcd # ── 网络监听 ────────────────────────────────────────────── ETCD_LISTEN_PEER_URLShttp://192.168.44.132:2380 ETCD_LISTEN_CLIENT_URLShttp://192.168.44.132:2379,http://127.0.0.1:2379 # ── 对外公告地址 ────────────────────────────────────────── ETCD_INITIAL_ADVERTISE_PEER_URLShttp://192.168.44.132:2380 ETCD_ADVERTISE_CLIENT_URLShttp://192.168.44.132:2379 # ── 集群初始化 ──────────────────────────────────────────── ETCD_INITIAL_CLUSTERnode1http://192.168.44.132:2380,node2http://192.168.44.133:2380,node3http://192.168.44.134:2380 ETCD_INITIAL_CLUSTER_TOKENetcd-cluster-prod ETCD_INITIAL_CLUSTER_STATEnew # ── 日志配置 ────────────────────────────────────────────── ETCD_LOGGERzap ETCD_LOG_OUTPUTS/var/log/etcd/etcd.log ETCD_LOG_LEVELinfo # ── 监控指标Prometheus──────────────────────────────── ETCD_METRICSextensive ETCD_LISTEN_METRICS_URLShttp://0.0.0.0:2381 # ── 性能与稳定性调优 ────────────────────────────────────── ETCD_AUTO_COMPACTION_RETENTION1 # 自动压缩保留1小时历史 ETCD_SNAPSHOT_COUNT5000 # 每5000次写入创建快照 ETCD_QUOTA_BACKEND_BYTES8589934592 # 后端存储配额 8GB ETCD_HEARTBEAT_INTERVAL100 # 心跳间隔 100ms ETCD_ELECTION_TIMEOUT1000 # 选举超时 1000ms ETCD_MAX_SNAPSHOTS5 ETCD_MAX_WALS5 ETCD_ENABLE_V2false EOFnode2192.168.44.133与 node1 相同仅修改以下字段bash复制ETCD_NAMEnode2 ETCD_LISTEN_PEER_URLShttp://192.168.44.133:2380 ETCD_LISTEN_CLIENT_URLShttp://192.168.44.133:2379,http://127.0.0.1:2379 ETCD_INITIAL_ADVERTISE_PEER_URLShttp://192.168.44.133:2380 ETCD_ADVERTISE_CLIENT_URLShttp://192.168.44.133:2379node3192.168.44.134与 node1 相同仅修改以下字段bash复制ETCD_NAMEnode3 ETCD_LISTEN_PEER_URLShttp://192.168.44.134:2380 ETCD_LISTEN_CLIENT_URLShttp://192.168.44.134:2379,http://127.0.0.1:2379 ETCD_INITIAL_ADVERTISE_PEER_URLShttp://192.168.44.134:2380 ETCD_ADVERTISE_CLIENT_URLShttp://192.168.44.134:23793.4 设置目录权限bash复制chown -R etcd:etcd /etc/etcd /var/lib/etcd /var/log/etcd3.5 创建 systemd 服务单元bash复制cat /etc/systemd/system/etcd.service EOF [Unit] Descriptionetcd Key-Value Store Documentationhttps://etcd.io/docs/ Afternetwork.target [Service] Typenotify Useretcd EnvironmentFile/etc/etcd/etcd.conf ExecStart/usr/local/bin/etcd Restarton-failure RestartSec5s LimitNOFILE65536 Nice-10 [Install] WantedBymulti-user.target EOF systemctl daemon-reload3.6 开放防火墙端口bash复制# 开放 etcd 所需端口 # 2379: 客户端通信 # 2380: 节点间 Raft 通信 # 2381: Prometheus metrics 采集 firewall-cmd --permanent --add-port2379-2381/tcp firewall-cmd --reload3.7 同时启动三个节点⚠️重要etcd 集群启动时必须有超过半数节点同时上线才能完成 Leader 选举。请在三台节点上几乎同时执行以下命令bash复制# 三台节点同时执行可用 tmux 同步 systemctl start etcd # 设置开机自启 systemctl enable etcd # 验证服务状态 systemctl status etcd四、验证 etcd 集群4.1 查看集群成员列表bash复制etcdctl \ --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \ member list -w table预期输出示例------------------------------------------------------------------------------------------------------ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | ------------------------------------------------------------------------------------------------------ | 445519fc8222d25a | started | node1 | http://192.168.44.132:2380 | http://192.168.44.132:2379 | false | | b65a5ac53d56bb92 | started | node2 | http://192.168.44.133:2380 | http://192.168.44.133:2379 | false | | 76efd20dda691bed | started | node3 | http://192.168.44.134:2380 | http://192.168.44.134:2379 | false | ------------------------------------------------------------------------------------------------------4.2 查看端点健康状态及 Leaderbash复制etcdctl \ --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \ endpoint status -w table4.3 读写测试bash复制# 写入测试键值 etcdctl --endpointshttp://192.168.44.132:2379 put /test/hello world # 读取验证 etcdctl --endpointshttp://192.168.44.132:2379 get /test/hello # 删除测试数据 etcdctl --endpointshttp://192.168.44.132:2379 del /test/hello4.4 检查监听端口bash复制ss -tlnp | grep -E 2379|2380|2381 # 应看到 2379客户端、2380peer、2381metrics三个端口在监听五、Prometheus 监控配置192.168.44.1355.1 添加 etcd 抓取配置编辑/opt/monitoring/prometheus/prometheus.yml在scrape_configs段追加yaml复制- job_name: etcd static_configs: - targets: - 192.168.44.132:2381 - 192.168.44.133:2381 - 192.168.44.134:2381 metrics_path: /metrics relabel_configs: - source_labels: [__address__] target_label: instance5.2 创建 etcd 告警规则文件创建/opt/monitoring/prometheus/etcd_alert_rules.ymlyaml复制groups: - name: etcd_alerts rules: - alert: EtcdNoLeader expr: etcd_server_has_leader 0 for: 3m labels: severity: critical annotations: summary: Etcd 集群无 Leader实例 {{ $labels.instance }} description: Etcd 集群当前不存在 Leader 节点集群已不可写可能发生选举异常或网络分区。\n 当前值 {{ $value }}\n 标签 {{ $labels }} - alert: EtcdHighNumberOfLeaderChangesCritical expr: increase(etcd_server_leader_changes_seen_total[10m]) 3 for: 0m labels: severity: critical annotations: summary: Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁 description: 10 分钟内 Etcd Leader 切换次数超过 3 次。\n 当前值 {{ $value }} - alert: EtcdHighNumberOfLeaderChangesWarning expr: increase(etcd_server_leader_changes_seen_total[10m]) 1 for: 0m labels: severity: warning annotations: summary: Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁 description: 10 分钟内 Etcd Leader 切换次数超过 1 次。\n 当前值 {{ $value }} - alert: EtcdHighNumberOfFailedProposals expr: increase(etcd_server_proposals_failed_total[1h]) 5 for: 2m labels: severity: warning annotations: summary: Etcd 提案失败次数过高实例 {{ $labels.instance }} description: 过去 1 小时内 Etcd 提案失败次数超过 5 次。\n 当前值 {{ $value }} - alert: EtcdBackendStorageQuotaExceed90Percent expr: (etcd_mvcc_db_total_size_in_bytes{job~etcd.*} / etcd_server_quota_backend_bytes{job~etcd.*}) * 100 90 for: 5m labels: severity: warning annotations: summary: Etcd 后端存储配额使用率过高实例 {{ $labels.instance }} description: Etcd 后端数据库占用存储配额已超过 90%。\n 当前使用率 {{ $value | humanizePercentage }} - alert: EtcdBackendStorageQuotaExceed95Percent expr: (etcd_mvcc_db_total_size_in_bytes{job~etcd.*} / etcd_server_quota_backend_bytes{job~etcd.*}) * 100 95 for: 5m labels: severity: critical annotations: summary: Etcd 后端存储配额使用率过高实例 {{ $labels.instance }} description: Etcd 后端数据库占用存储配额已超过 95%即将触发写阻塞\n 当前使用率 {{ $value | humanizePercentage }} - alert: EtcdHighGrpcRequestFailureRate expr: sum(rate(grpc_server_handled_total{grpc_code!OK}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) 0.01 for: 5m labels: severity: warning annotations: summary: Etcd gRPC 请求失败率偏高实例 {{ $labels.instance }} description: Etcd gRPC 请求失败率 1%。\n 当前失败率 {{ $value }} - alert: EtcdHighGrpcRequestFailureCount expr: sum(increase(grpc_server_handled_total{job~etcd.*, grpc_code!OK}[5m])) BY (grpc_service, grpc_method) 3 for: 5m labels: severity: warning annotations: summary: Etcd gRPC 请求失败次数偏高 description: Etcd gRPC 请求失败次数超过 3 次可能存在网络抖动、节点负载高或 Leader 切换。5.3 在 prometheus.yml 中引用告警规则文件yaml复制rule_files: - etcd_alert_rules.yml - alert_rules.yml5.4 重启 Prometheus 使配置生效bash复制cd /opt/monitoring docker compose restart prometheus # 验证 etcd 抓取目标状态 curl -s http://localhost:9090/api/v1/targets | python3 -c \ import sys,json; djson.load(sys.stdin); \ [print(t[labels][job], t[labels][instance], t[health]) \ for t in d[data][activeTargets] if etcd in t[labels].get(job,)] # 预期输出 # etcd 192.168.44.132:2381 up # etcd 192.168.44.133:2381 up # etcd 192.168.44.134:2381 up六、Zabbix 监控配置192.168.44.1356.1 通过 Zabbix API 配置监控获取 API Tokenbash复制TOKEN$(curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -d {jsonrpc:2.0,method:user.login,params:{username:Admin,password:zabbix},id:1} \ | python3 -c import sys,json; print(json.load(sys.stdin)[result])) echo Token: $TOKEN创建 etcd 监控模板bash复制curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d { jsonrpc:2.0, method:template.create, params:{ host:Template etcd, name:Template etcd Service Monitor, groups:[{groupid:1}] }, id:2 } # 记录返回的 templateid例如 10772添加监控项bash复制TEMPLATE_ID10772 # 监控项1客户端端口 2379 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd client port 2379 check\, \key_\:\net.tcp.port[{HOST.IP},2379]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\30s\},\id\:3} # 监控项2peer 端口 2380 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd peer port 2380 check\, \key_\:\net.tcp.port[{HOST.IP},2380]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\30s\},\id\:4} # 监控项3metrics 端口 2381 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd metrics port 2381 check\, \key_\:\net.tcp.port[{HOST.IP},2381]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\30s\},\id\:5} # 监控项4etcd 进程数量 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd process count\, \key_\:\proc.num[etcd]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\60s\},\id\:6} # 监控项5etcd health endpoint curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd health check\, \key_\:\web.page.regexp[http://{HOST.IP}:2381/health,,\\\\\\\{.*health.*\\\\}\\\,0]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:1,\delay\:\30s\},\id\:7}添加告警触发器Zabbix 7.x 新语法bash复制# 触发器1etcd 客户端端口不可达Disaster级别 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\trigger.create\,\params\:{ \description\:\etcd client port 2379 unreachable on {HOST.NAME}\, \expression\:\last(/Template etcd/net.tcp.port[{HOST.IP},2379])0\, \priority\:5},\id\:8} # 触发器2etcd peer 端口不可达 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\trigger.create\,\params\:{ \description\:\etcd peer port 2380 unreachable on {HOST.NAME}\, \expression\:\last(/Template etcd/net.tcp.port[{HOST.IP},2380])0\, \priority\:5},\id\:9} # 触发器3etcd 进程未运行 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\trigger.create\,\params\:{ \description\:\etcd process not running on {HOST.NAME}\, \expression\:\last(/Template etcd/proc.num[etcd])1\, \priority\:5},\id\:10}将模板关联到三台主机bash复制for hostid in 10769 10770 10771; do curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\host.update\,\params\:{ \hostid\:\$hostid\, \templates\:[{\templateid\:\10343\},{\templateid\:\10772\}]},\id\:11} done6.2 在 Zabbix Web 界面验证访问 http://192.168.44.135:8080账号Admin / zabbix进入Configuration → Templates确认 Template etcd Service Monitor 存在进入Configuration → Hosts确认 132/133/134 均关联了 etcd 模板进入Monitoring → Latest data筛选 etcd 相关监控项查看数据七、etcd 数据备份与恢复7.1 手动备份bash复制# 创建快照在任一节点执行 etcdctl \ --endpointshttp://192.168.44.132:2379 \ snapshot save /backups/etcd-snapshot-$(date %Y%m%d_%H%M%S).db # 验证快照 etcdutl snapshot status /backups/etcd-snapshot-*.db -w table7.2 定时备份crontabbash复制# 每天凌晨 1:00 备份保留 7 天 crontab -e # 添加 0 1 * * * /usr/local/bin/etcdctl --endpointshttp://127.0.0.1:2379 snapshot save /backups/etcd-$(date \%Y\%m\%d).db find /backups -name etcd-*.db -mtime 7 -delete7.3 数据恢复bash复制# 1. 停止 etcd 服务所有节点 systemctl stop etcd # 2. 备份当前数据防止误操作 mv /var/lib/etcd /var/lib/etcd.bak # 3. 从快照恢复所有节点使用同一个快照 etcdutl snapshot restore /backups/etcd-20260329.db \ --name node1 \ --initial-cluster node1http://192.168.44.132:2380,node2http://192.168.44.133:2380,node3http://192.168.44.134:2380 \ --initial-cluster-token etcd-cluster-prod \ --initial-advertise-peer-urls http://192.168.44.132:2380 \ --data-dir /var/lib/etcd # 4. 修正权限 chown -R etcd:etcd /var/lib/etcd # 5. 启动 etcd systemctl start etcd八、常见运维命令bash复制# 查看集群成员 etcdctl --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 member list -w table # 查看 Leader 和状态 etcdctl --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 endpoint status -w table # 检查 metricsPrometheus 指标 curl http://192.168.44.132:2381/metrics | grep etcd_server_has_leader # 检查健康状态 curl http://192.168.44.132:2381/health # 查看日志 journalctl -u etcd --no-pager -n 50 tail -f /var/log/etcd/etcd.log九、注意事项节点数选择强烈推荐奇数节点3、5、7避免选举平票问题。3 节点最多容忍 1 个节点故障网络连通性确保节点间 2379客户端和 2380Peer Raft端口双向可达时间同步节点间时间偏差超过 500ms 可能导致 Leader 频繁切换必须配置 NTP/chronyetcd v3.6 API 变化v3.6 不再需要ETCDCTL_API3环境变量默认即为 v3 APIZabbix 7.4 触发器语法变更新版本使用last(/TemplateName/item.key[])value格式不再支持旧格式{TemplateName:item.key[].last()}valuePrometheus metrics 端口etcd v3.6 metrics 监听在2381端口需在配置中显式设置ETCD_LISTEN_METRICS_URLS你们的环境是三节点裸机 etcd Prometheus 独立部署优先用15308功能最完整。如果团队不熟悉英文 Dashboard可以配合23560一起用。
etcd 高可用集群部署及监控配置指南
发布时间:2026/6/21 12:46:45
环境信息OSRocky Linux 9.7etcd 版本v3.6.7节点规划3节点集群满足 Raft 多数派协议最多容忍 1 节点故障Zabbix Server192.168.44.135端口 8080Prometheus192.168.44.135端口 9090一、节点规划节点名IP 地址角色node1192.168.44.132etcd membernode2192.168.44.133etcd membernode3192.168.44.134etcd member初始 Leader二、前置准备2.1 时间同步所有节点etcd 集群对时间一致性有严格要求节点间时间偏差需保持在 500ms 以内。bash复制# 安装并启动 chrony yum install -y chrony systemctl enable chronyd systemctl start chronyd # 手动同步时间 timedatectl set-ntp true chronyc tracking2.2 安装依赖工具所有节点bash复制# 安装 tarRocky Linux 9 minimal 版本默认不带 tar yum install -y tar2.3 准备安装包从本地 Windows 机器192.168.44.1通过 HTTP 服务分发安装包bash复制# 在 Windows 上启动临时 HTTP 服务Downloads 目录下 python -m http.server 18888 --directory C:\Users\qiyongquan\Downloads # 在各 Linux 节点下载 curl -o /tmp/etcd-v3.6.7-linux-amd64.tar.gz http://192.168.44.1:18888/etcd-v3.6.7-linux-amd64.tar.gz三、etcd 安装以下步骤在所有三台节点上执行节点间配置仅 IP 和名称不同3.1 创建系统用户和目录bash复制# 创建 etcd 专用系统用户无登录 shell useradd -r -s /sbin/nologin etcd # 创建数据目录、日志目录、配置目录 mkdir -p /etc/etcd /var/lib/etcd /var/log/etcd3.2 安装二进制文件bash复制# 解压安装包注意需先安装 tar cd /tmp tar xzf etcd-v3.6.7-linux-amd64.tar.gz # 复制二进制到系统路径 cp /tmp/etcd-v3.6.7-linux-amd64/etcd \ /tmp/etcd-v3.6.7-linux-amd64/etcdctl \ /tmp/etcd-v3.6.7-linux-amd64/etcdutl \ /usr/local/bin/ chmod x /usr/local/bin/etcd /usr/local/bin/etcdctl /usr/local/bin/etcdutl # 验证版本 etcd --version # etcd Version: 3.6.73.3 创建配置文件node1192.168.44.132bash复制cat /etc/etcd/etcd.conf EOF # ── 基础标识 ────────────────────────────────────────────── ETCD_NAMEnode1 ETCD_DATA_DIR/var/lib/etcd # ── 网络监听 ────────────────────────────────────────────── ETCD_LISTEN_PEER_URLShttp://192.168.44.132:2380 ETCD_LISTEN_CLIENT_URLShttp://192.168.44.132:2379,http://127.0.0.1:2379 # ── 对外公告地址 ────────────────────────────────────────── ETCD_INITIAL_ADVERTISE_PEER_URLShttp://192.168.44.132:2380 ETCD_ADVERTISE_CLIENT_URLShttp://192.168.44.132:2379 # ── 集群初始化 ──────────────────────────────────────────── ETCD_INITIAL_CLUSTERnode1http://192.168.44.132:2380,node2http://192.168.44.133:2380,node3http://192.168.44.134:2380 ETCD_INITIAL_CLUSTER_TOKENetcd-cluster-prod ETCD_INITIAL_CLUSTER_STATEnew # ── 日志配置 ────────────────────────────────────────────── ETCD_LOGGERzap ETCD_LOG_OUTPUTS/var/log/etcd/etcd.log ETCD_LOG_LEVELinfo # ── 监控指标Prometheus──────────────────────────────── ETCD_METRICSextensive ETCD_LISTEN_METRICS_URLShttp://0.0.0.0:2381 # ── 性能与稳定性调优 ────────────────────────────────────── ETCD_AUTO_COMPACTION_RETENTION1 # 自动压缩保留1小时历史 ETCD_SNAPSHOT_COUNT5000 # 每5000次写入创建快照 ETCD_QUOTA_BACKEND_BYTES8589934592 # 后端存储配额 8GB ETCD_HEARTBEAT_INTERVAL100 # 心跳间隔 100ms ETCD_ELECTION_TIMEOUT1000 # 选举超时 1000ms ETCD_MAX_SNAPSHOTS5 ETCD_MAX_WALS5 ETCD_ENABLE_V2false EOFnode2192.168.44.133与 node1 相同仅修改以下字段bash复制ETCD_NAMEnode2 ETCD_LISTEN_PEER_URLShttp://192.168.44.133:2380 ETCD_LISTEN_CLIENT_URLShttp://192.168.44.133:2379,http://127.0.0.1:2379 ETCD_INITIAL_ADVERTISE_PEER_URLShttp://192.168.44.133:2380 ETCD_ADVERTISE_CLIENT_URLShttp://192.168.44.133:2379node3192.168.44.134与 node1 相同仅修改以下字段bash复制ETCD_NAMEnode3 ETCD_LISTEN_PEER_URLShttp://192.168.44.134:2380 ETCD_LISTEN_CLIENT_URLShttp://192.168.44.134:2379,http://127.0.0.1:2379 ETCD_INITIAL_ADVERTISE_PEER_URLShttp://192.168.44.134:2380 ETCD_ADVERTISE_CLIENT_URLShttp://192.168.44.134:23793.4 设置目录权限bash复制chown -R etcd:etcd /etc/etcd /var/lib/etcd /var/log/etcd3.5 创建 systemd 服务单元bash复制cat /etc/systemd/system/etcd.service EOF [Unit] Descriptionetcd Key-Value Store Documentationhttps://etcd.io/docs/ Afternetwork.target [Service] Typenotify Useretcd EnvironmentFile/etc/etcd/etcd.conf ExecStart/usr/local/bin/etcd Restarton-failure RestartSec5s LimitNOFILE65536 Nice-10 [Install] WantedBymulti-user.target EOF systemctl daemon-reload3.6 开放防火墙端口bash复制# 开放 etcd 所需端口 # 2379: 客户端通信 # 2380: 节点间 Raft 通信 # 2381: Prometheus metrics 采集 firewall-cmd --permanent --add-port2379-2381/tcp firewall-cmd --reload3.7 同时启动三个节点⚠️重要etcd 集群启动时必须有超过半数节点同时上线才能完成 Leader 选举。请在三台节点上几乎同时执行以下命令bash复制# 三台节点同时执行可用 tmux 同步 systemctl start etcd # 设置开机自启 systemctl enable etcd # 验证服务状态 systemctl status etcd四、验证 etcd 集群4.1 查看集群成员列表bash复制etcdctl \ --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \ member list -w table预期输出示例------------------------------------------------------------------------------------------------------ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | ------------------------------------------------------------------------------------------------------ | 445519fc8222d25a | started | node1 | http://192.168.44.132:2380 | http://192.168.44.132:2379 | false | | b65a5ac53d56bb92 | started | node2 | http://192.168.44.133:2380 | http://192.168.44.133:2379 | false | | 76efd20dda691bed | started | node3 | http://192.168.44.134:2380 | http://192.168.44.134:2379 | false | ------------------------------------------------------------------------------------------------------4.2 查看端点健康状态及 Leaderbash复制etcdctl \ --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 \ endpoint status -w table4.3 读写测试bash复制# 写入测试键值 etcdctl --endpointshttp://192.168.44.132:2379 put /test/hello world # 读取验证 etcdctl --endpointshttp://192.168.44.132:2379 get /test/hello # 删除测试数据 etcdctl --endpointshttp://192.168.44.132:2379 del /test/hello4.4 检查监听端口bash复制ss -tlnp | grep -E 2379|2380|2381 # 应看到 2379客户端、2380peer、2381metrics三个端口在监听五、Prometheus 监控配置192.168.44.1355.1 添加 etcd 抓取配置编辑/opt/monitoring/prometheus/prometheus.yml在scrape_configs段追加yaml复制- job_name: etcd static_configs: - targets: - 192.168.44.132:2381 - 192.168.44.133:2381 - 192.168.44.134:2381 metrics_path: /metrics relabel_configs: - source_labels: [__address__] target_label: instance5.2 创建 etcd 告警规则文件创建/opt/monitoring/prometheus/etcd_alert_rules.ymlyaml复制groups: - name: etcd_alerts rules: - alert: EtcdNoLeader expr: etcd_server_has_leader 0 for: 3m labels: severity: critical annotations: summary: Etcd 集群无 Leader实例 {{ $labels.instance }} description: Etcd 集群当前不存在 Leader 节点集群已不可写可能发生选举异常或网络分区。\n 当前值 {{ $value }}\n 标签 {{ $labels }} - alert: EtcdHighNumberOfLeaderChangesCritical expr: increase(etcd_server_leader_changes_seen_total[10m]) 3 for: 0m labels: severity: critical annotations: summary: Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁 description: 10 分钟内 Etcd Leader 切换次数超过 3 次。\n 当前值 {{ $value }} - alert: EtcdHighNumberOfLeaderChangesWarning expr: increase(etcd_server_leader_changes_seen_total[10m]) 1 for: 0m labels: severity: warning annotations: summary: Etcd 实例 {{ $labels.instance }} 主节点切换过于频繁 description: 10 分钟内 Etcd Leader 切换次数超过 1 次。\n 当前值 {{ $value }} - alert: EtcdHighNumberOfFailedProposals expr: increase(etcd_server_proposals_failed_total[1h]) 5 for: 2m labels: severity: warning annotations: summary: Etcd 提案失败次数过高实例 {{ $labels.instance }} description: 过去 1 小时内 Etcd 提案失败次数超过 5 次。\n 当前值 {{ $value }} - alert: EtcdBackendStorageQuotaExceed90Percent expr: (etcd_mvcc_db_total_size_in_bytes{job~etcd.*} / etcd_server_quota_backend_bytes{job~etcd.*}) * 100 90 for: 5m labels: severity: warning annotations: summary: Etcd 后端存储配额使用率过高实例 {{ $labels.instance }} description: Etcd 后端数据库占用存储配额已超过 90%。\n 当前使用率 {{ $value | humanizePercentage }} - alert: EtcdBackendStorageQuotaExceed95Percent expr: (etcd_mvcc_db_total_size_in_bytes{job~etcd.*} / etcd_server_quota_backend_bytes{job~etcd.*}) * 100 95 for: 5m labels: severity: critical annotations: summary: Etcd 后端存储配额使用率过高实例 {{ $labels.instance }} description: Etcd 后端数据库占用存储配额已超过 95%即将触发写阻塞\n 当前使用率 {{ $value | humanizePercentage }} - alert: EtcdHighGrpcRequestFailureRate expr: sum(rate(grpc_server_handled_total{grpc_code!OK}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total[5m])) BY (grpc_service, grpc_method) 0.01 for: 5m labels: severity: warning annotations: summary: Etcd gRPC 请求失败率偏高实例 {{ $labels.instance }} description: Etcd gRPC 请求失败率 1%。\n 当前失败率 {{ $value }} - alert: EtcdHighGrpcRequestFailureCount expr: sum(increase(grpc_server_handled_total{job~etcd.*, grpc_code!OK}[5m])) BY (grpc_service, grpc_method) 3 for: 5m labels: severity: warning annotations: summary: Etcd gRPC 请求失败次数偏高 description: Etcd gRPC 请求失败次数超过 3 次可能存在网络抖动、节点负载高或 Leader 切换。5.3 在 prometheus.yml 中引用告警规则文件yaml复制rule_files: - etcd_alert_rules.yml - alert_rules.yml5.4 重启 Prometheus 使配置生效bash复制cd /opt/monitoring docker compose restart prometheus # 验证 etcd 抓取目标状态 curl -s http://localhost:9090/api/v1/targets | python3 -c \ import sys,json; djson.load(sys.stdin); \ [print(t[labels][job], t[labels][instance], t[health]) \ for t in d[data][activeTargets] if etcd in t[labels].get(job,)] # 预期输出 # etcd 192.168.44.132:2381 up # etcd 192.168.44.133:2381 up # etcd 192.168.44.134:2381 up六、Zabbix 监控配置192.168.44.1356.1 通过 Zabbix API 配置监控获取 API Tokenbash复制TOKEN$(curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -d {jsonrpc:2.0,method:user.login,params:{username:Admin,password:zabbix},id:1} \ | python3 -c import sys,json; print(json.load(sys.stdin)[result])) echo Token: $TOKEN创建 etcd 监控模板bash复制curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d { jsonrpc:2.0, method:template.create, params:{ host:Template etcd, name:Template etcd Service Monitor, groups:[{groupid:1}] }, id:2 } # 记录返回的 templateid例如 10772添加监控项bash复制TEMPLATE_ID10772 # 监控项1客户端端口 2379 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd client port 2379 check\, \key_\:\net.tcp.port[{HOST.IP},2379]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\30s\},\id\:3} # 监控项2peer 端口 2380 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd peer port 2380 check\, \key_\:\net.tcp.port[{HOST.IP},2380]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\30s\},\id\:4} # 监控项3metrics 端口 2381 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd metrics port 2381 check\, \key_\:\net.tcp.port[{HOST.IP},2381]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\30s\},\id\:5} # 监控项4etcd 进程数量 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd process count\, \key_\:\proc.num[etcd]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:3,\delay\:\60s\},\id\:6} # 监控项5etcd health endpoint curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\item.create\,\params\:{ \name\:\etcd health check\, \key_\:\web.page.regexp[http://{HOST.IP}:2381/health,,\\\\\\\{.*health.*\\\\}\\\,0]\, \hostid\:\$TEMPLATE_ID\,\type\:0,\value_type\:1,\delay\:\30s\},\id\:7}添加告警触发器Zabbix 7.x 新语法bash复制# 触发器1etcd 客户端端口不可达Disaster级别 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\trigger.create\,\params\:{ \description\:\etcd client port 2379 unreachable on {HOST.NAME}\, \expression\:\last(/Template etcd/net.tcp.port[{HOST.IP},2379])0\, \priority\:5},\id\:8} # 触发器2etcd peer 端口不可达 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\trigger.create\,\params\:{ \description\:\etcd peer port 2380 unreachable on {HOST.NAME}\, \expression\:\last(/Template etcd/net.tcp.port[{HOST.IP},2380])0\, \priority\:5},\id\:9} # 触发器3etcd 进程未运行 curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\trigger.create\,\params\:{ \description\:\etcd process not running on {HOST.NAME}\, \expression\:\last(/Template etcd/proc.num[etcd])1\, \priority\:5},\id\:10}将模板关联到三台主机bash复制for hostid in 10769 10770 10771; do curl -s http://localhost:8080/api_jsonrpc.php \ -H Content-Type: application/json \ -H Authorization: Bearer $TOKEN \ -d {\jsonrpc\:\2.0\,\method\:\host.update\,\params\:{ \hostid\:\$hostid\, \templates\:[{\templateid\:\10343\},{\templateid\:\10772\}]},\id\:11} done6.2 在 Zabbix Web 界面验证访问 http://192.168.44.135:8080账号Admin / zabbix进入Configuration → Templates确认 Template etcd Service Monitor 存在进入Configuration → Hosts确认 132/133/134 均关联了 etcd 模板进入Monitoring → Latest data筛选 etcd 相关监控项查看数据七、etcd 数据备份与恢复7.1 手动备份bash复制# 创建快照在任一节点执行 etcdctl \ --endpointshttp://192.168.44.132:2379 \ snapshot save /backups/etcd-snapshot-$(date %Y%m%d_%H%M%S).db # 验证快照 etcdutl snapshot status /backups/etcd-snapshot-*.db -w table7.2 定时备份crontabbash复制# 每天凌晨 1:00 备份保留 7 天 crontab -e # 添加 0 1 * * * /usr/local/bin/etcdctl --endpointshttp://127.0.0.1:2379 snapshot save /backups/etcd-$(date \%Y\%m\%d).db find /backups -name etcd-*.db -mtime 7 -delete7.3 数据恢复bash复制# 1. 停止 etcd 服务所有节点 systemctl stop etcd # 2. 备份当前数据防止误操作 mv /var/lib/etcd /var/lib/etcd.bak # 3. 从快照恢复所有节点使用同一个快照 etcdutl snapshot restore /backups/etcd-20260329.db \ --name node1 \ --initial-cluster node1http://192.168.44.132:2380,node2http://192.168.44.133:2380,node3http://192.168.44.134:2380 \ --initial-cluster-token etcd-cluster-prod \ --initial-advertise-peer-urls http://192.168.44.132:2380 \ --data-dir /var/lib/etcd # 4. 修正权限 chown -R etcd:etcd /var/lib/etcd # 5. 启动 etcd systemctl start etcd八、常见运维命令bash复制# 查看集群成员 etcdctl --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 member list -w table # 查看 Leader 和状态 etcdctl --endpointshttp://192.168.44.132:2379,http://192.168.44.133:2379,http://192.168.44.134:2379 endpoint status -w table # 检查 metricsPrometheus 指标 curl http://192.168.44.132:2381/metrics | grep etcd_server_has_leader # 检查健康状态 curl http://192.168.44.132:2381/health # 查看日志 journalctl -u etcd --no-pager -n 50 tail -f /var/log/etcd/etcd.log九、注意事项节点数选择强烈推荐奇数节点3、5、7避免选举平票问题。3 节点最多容忍 1 个节点故障网络连通性确保节点间 2379客户端和 2380Peer Raft端口双向可达时间同步节点间时间偏差超过 500ms 可能导致 Leader 频繁切换必须配置 NTP/chronyetcd v3.6 API 变化v3.6 不再需要ETCDCTL_API3环境变量默认即为 v3 APIZabbix 7.4 触发器语法变更新版本使用last(/TemplateName/item.key[])value格式不再支持旧格式{TemplateName:item.key[].last()}valuePrometheus metrics 端口etcd v3.6 metrics 监听在2381端口需在配置中显式设置ETCD_LISTEN_METRICS_URLS你们的环境是三节点裸机 etcd Prometheus 独立部署优先用15308功能最完整。如果团队不熟悉英文 Dashboard可以配合23560一起用。