DDPG算法里的‘演员’和‘评论家’到底在吵什么用Python代码逐行拆解训练过程想象一下你正在导演一场没有剧本的即兴戏剧。演员Actor需要在舞台上即兴发挥而评论家Critic则在台下实时点评。这场戏的特殊之处在于——演员的动作可以精确到毫米级的角度变化而评论家的打分标准也在不断调整。这就是DDPG深度确定性策略梯度算法的核心戏剧冲突。让我们用PyTorch代码作为舞台揭开这场表演艺术背后的技术内幕。1. 搭建舞台DDPG的四大角色初始化任何好戏都需要精心搭建舞台。在DDPG的宇宙里我们需要先准备好四个关键神经网络import torch import torch.nn as nn import torch.optim as optim import numpy as np class Actor(nn.Module): def __init__(self, state_dim, action_dim, max_action): super(Actor, self).__init__() self.layer_1 nn.Linear(state_dim, 400) self.layer_2 nn.Linear(400, 300) self.layer_3 nn.Linear(300, action_dim) self.max_action max_action def forward(self, state): x torch.relu(self.layer_1(state)) x torch.relu(self.layer_2(x)) return self.max_action * torch.tanh(self.layer_3(x)) class Critic(nn.Module): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.layer_1 nn.Linear(state_dim action_dim, 400) self.layer_2 nn.Linear(400, 300) self.layer_3 nn.Linear(300, 1) def forward(self, state, action): x torch.cat([state, action], dim1) x torch.relu(self.layer_1(x)) x torch.relu(self.layer_2(x)) return self.layer_3(x)这里有两个关键设计决策值得注意Actor的输出层使用tanh将动作限制在[-max_action, max_action]范围内Critic接收状态和动作的拼接这是Q函数的典型设计用于评估(state, action)对的价值四个角色的初始化就像组建剧团# 主演员和主评论家 actor Actor(state_dim, action_dim, max_action) critic Critic(state_dim, action_dim) # 备用演员和备用评论家目标网络 target_actor Actor(state_dim, action_dim, max_action) target_critic Critic(state_dim, action_dim) # 初始时目标网络与主网络参数相同 target_actor.load_state_dict(actor.state_dict()) target_critic.load_state_dict(critic.state_dict())2. 排练过程训练循环中的动态博弈真正的戏剧性冲突发生在训练循环中。让我们分解一个完整的训练步骤2.1 经验收集阶段def select_action(state, noise): state torch.FloatTensor(state.reshape(1, -1)) action actor(state).data.numpy().flatten() return np.clip(action noise, -max_action, max_action) # 在环境中执行动作并存储经验 next_state, reward, done, _ env.step(action) replay_buffer.add(state, action, reward, next_state, done)这里引入的探索噪声就像演员的即兴发挥——在确定性策略中加入随机性避免表演变得刻板。常见的选择是Ornstein-Uhlenbeck噪声它能产生时间相关的随机过程适合物理系统的连续控制。2.2 批评家的学习时刻从经验池采样后Critic开始它的毒舌点评# 计算目标Q值 target_actions target_actor(next_states) target_q_values target_critic(next_states, target_actions) targets rewards (1 - dones) * gamma * target_q_values # 计算当前Q值估计 current_q_values critic(states, actions) # Critic损失函数 critic_loss nn.MSELoss()(current_q_values, targets.detach())Critic的更新包含三个关键点使用目标网络计算target_q_values保持稳定性targets.detach()切断梯度回传防止干扰目标网络(1 - dones)项处理回合终止时的特殊情况2.3 演员的自我修养Actor的更新则更有意思——它试图讨好Criticactor_loss -critic(states, actor(states)).mean()这个简单的表达式蕴含着深度策略梯度通过Critic评估Actor当前策略的表现负号表示我们要最大化这个评估值梯度上升转化为损失函数的极小化2.4 温和的更新软同步机制DDPG最精妙的设计在于目标网络的更新方式def soft_update(target, source, tau): for target_param, param in zip(target.parameters(), source.parameters()): target_param.data.copy_(tau * param.data (1 - tau) * target_param.data) # 更新目标网络 soft_update(target_actor, actor, tau) soft_update(target_critic, critic, tau)这种Polyak平均策略tau通常取0.005就像老演员缓慢吸收新演员的表演风格避免突然的风格转变吓到观众。3. 幕后花絮关键技巧与调试经验在实际制作中有几个幕后技巧决定了演出成败3.1 经验回放的秘密配方class ReplayBuffer: def __init__(self, max_size): self.buffer [] self.max_size max_size def add(self, state, action, reward, next_state, done): if len(self.buffer) self.max_size: self.buffer.pop(0) self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): indices np.random.choice(len(self.buffer), batch_size) states, actions, rewards, next_states, dones zip(*[self.buffer[i] for i in indices]) return np.array(states), np.array(actions), np.array(rewards), np.array(next_states), np.array(dones)经验回放的两个关键参数buffer大小通常1e5到1e6太小导致样本相关性高太大则学习缓慢batch大小一般从128开始尝试复杂任务可能需要更大batch3.2 学习率的舞蹈Actor和Critic通常需要不同的学习节奏actor_optimizer optim.Adam(actor.parameters(), lr1e-4) critic_optimizer optim.Adam(critic.parameters(), lr1e-3)典型配置Critic学习率是Actor的5-10倍太高的Actor学习率会导致策略震荡太低的Critic学习率则使反馈信号滞后3.3 噪声退火策略聪明的导演会随着排练进度减少即兴发挥def update_noise(noise_scale): noise_scale * 0.9999 # 指数衰减 return max(noise_scale, 0.1) # 保持最小探索这种退火策略平衡了初期高噪声促进探索后期低噪声利于策略精修4. 完整演出Pendulum-v1实例解析让我们看一个钟摆平衡的具体案例。以下是训练循环的核心代码for episode in range(total_episodes): state env.reset() episode_reward 0 noise_scale initial_noise for step in range(max_steps): action select_action(state, noise_scale * np.random.randn(action_dim)) next_state, reward, done, _ env.step(action) replay_buffer.add(state, action, reward, next_state, done) state next_state episode_reward reward if len(replay_buffer) batch_size: states, actions, rewards, next_states, dones replay_buffer.sample(batch_size) # 转换为PyTorch张量 states torch.FloatTensor(states) actions torch.FloatTensor(actions) rewards torch.FloatTensor(rewards).unsqueeze(1) next_states torch.FloatTensor(next_states) dones torch.FloatTensor(dones).unsqueeze(1) # Critic更新 critic_optimizer.zero_grad() critic_loss compute_critic_loss(states, actions, rewards, next_states, dones) critic_loss.backward() critic_optimizer.step() # Actor更新 actor_optimizer.zero_grad() actor_loss compute_actor_loss(states) actor_loss.backward() actor_optimizer.step() # 软更新目标网络 soft_update(target_actor, actor, tau) soft_update(target_critic, critic, tau) noise_scale update_noise(noise_scale) print(fEpisode {episode}, Reward: {episode_reward})训练过程中常见的现象记录训练阶段典型现象解决方案初期 (0-1k步)奖励随机波动增加噪声规模增大回放缓冲区中期 (1k-10k步)偶尔出现高分但不稳定检查Critic损失是否收敛调整学习率后期 (10k步)性能平台期尝试减小噪声微调网络结构在Pendulum-v1环境中成功的训练通常会在约50-100个episode后开始出现稳定的摆动策略300个episode左右能达到接近最优的性能。
DDPG算法里的‘演员’和‘评论家’到底在吵什么?用Python代码逐行拆解训练过程
发布时间:2026/5/30 19:42:20
DDPG算法里的‘演员’和‘评论家’到底在吵什么用Python代码逐行拆解训练过程想象一下你正在导演一场没有剧本的即兴戏剧。演员Actor需要在舞台上即兴发挥而评论家Critic则在台下实时点评。这场戏的特殊之处在于——演员的动作可以精确到毫米级的角度变化而评论家的打分标准也在不断调整。这就是DDPG深度确定性策略梯度算法的核心戏剧冲突。让我们用PyTorch代码作为舞台揭开这场表演艺术背后的技术内幕。1. 搭建舞台DDPG的四大角色初始化任何好戏都需要精心搭建舞台。在DDPG的宇宙里我们需要先准备好四个关键神经网络import torch import torch.nn as nn import torch.optim as optim import numpy as np class Actor(nn.Module): def __init__(self, state_dim, action_dim, max_action): super(Actor, self).__init__() self.layer_1 nn.Linear(state_dim, 400) self.layer_2 nn.Linear(400, 300) self.layer_3 nn.Linear(300, action_dim) self.max_action max_action def forward(self, state): x torch.relu(self.layer_1(state)) x torch.relu(self.layer_2(x)) return self.max_action * torch.tanh(self.layer_3(x)) class Critic(nn.Module): def __init__(self, state_dim, action_dim): super(Critic, self).__init__() self.layer_1 nn.Linear(state_dim action_dim, 400) self.layer_2 nn.Linear(400, 300) self.layer_3 nn.Linear(300, 1) def forward(self, state, action): x torch.cat([state, action], dim1) x torch.relu(self.layer_1(x)) x torch.relu(self.layer_2(x)) return self.layer_3(x)这里有两个关键设计决策值得注意Actor的输出层使用tanh将动作限制在[-max_action, max_action]范围内Critic接收状态和动作的拼接这是Q函数的典型设计用于评估(state, action)对的价值四个角色的初始化就像组建剧团# 主演员和主评论家 actor Actor(state_dim, action_dim, max_action) critic Critic(state_dim, action_dim) # 备用演员和备用评论家目标网络 target_actor Actor(state_dim, action_dim, max_action) target_critic Critic(state_dim, action_dim) # 初始时目标网络与主网络参数相同 target_actor.load_state_dict(actor.state_dict()) target_critic.load_state_dict(critic.state_dict())2. 排练过程训练循环中的动态博弈真正的戏剧性冲突发生在训练循环中。让我们分解一个完整的训练步骤2.1 经验收集阶段def select_action(state, noise): state torch.FloatTensor(state.reshape(1, -1)) action actor(state).data.numpy().flatten() return np.clip(action noise, -max_action, max_action) # 在环境中执行动作并存储经验 next_state, reward, done, _ env.step(action) replay_buffer.add(state, action, reward, next_state, done)这里引入的探索噪声就像演员的即兴发挥——在确定性策略中加入随机性避免表演变得刻板。常见的选择是Ornstein-Uhlenbeck噪声它能产生时间相关的随机过程适合物理系统的连续控制。2.2 批评家的学习时刻从经验池采样后Critic开始它的毒舌点评# 计算目标Q值 target_actions target_actor(next_states) target_q_values target_critic(next_states, target_actions) targets rewards (1 - dones) * gamma * target_q_values # 计算当前Q值估计 current_q_values critic(states, actions) # Critic损失函数 critic_loss nn.MSELoss()(current_q_values, targets.detach())Critic的更新包含三个关键点使用目标网络计算target_q_values保持稳定性targets.detach()切断梯度回传防止干扰目标网络(1 - dones)项处理回合终止时的特殊情况2.3 演员的自我修养Actor的更新则更有意思——它试图讨好Criticactor_loss -critic(states, actor(states)).mean()这个简单的表达式蕴含着深度策略梯度通过Critic评估Actor当前策略的表现负号表示我们要最大化这个评估值梯度上升转化为损失函数的极小化2.4 温和的更新软同步机制DDPG最精妙的设计在于目标网络的更新方式def soft_update(target, source, tau): for target_param, param in zip(target.parameters(), source.parameters()): target_param.data.copy_(tau * param.data (1 - tau) * target_param.data) # 更新目标网络 soft_update(target_actor, actor, tau) soft_update(target_critic, critic, tau)这种Polyak平均策略tau通常取0.005就像老演员缓慢吸收新演员的表演风格避免突然的风格转变吓到观众。3. 幕后花絮关键技巧与调试经验在实际制作中有几个幕后技巧决定了演出成败3.1 经验回放的秘密配方class ReplayBuffer: def __init__(self, max_size): self.buffer [] self.max_size max_size def add(self, state, action, reward, next_state, done): if len(self.buffer) self.max_size: self.buffer.pop(0) self.buffer.append((state, action, reward, next_state, done)) def sample(self, batch_size): indices np.random.choice(len(self.buffer), batch_size) states, actions, rewards, next_states, dones zip(*[self.buffer[i] for i in indices]) return np.array(states), np.array(actions), np.array(rewards), np.array(next_states), np.array(dones)经验回放的两个关键参数buffer大小通常1e5到1e6太小导致样本相关性高太大则学习缓慢batch大小一般从128开始尝试复杂任务可能需要更大batch3.2 学习率的舞蹈Actor和Critic通常需要不同的学习节奏actor_optimizer optim.Adam(actor.parameters(), lr1e-4) critic_optimizer optim.Adam(critic.parameters(), lr1e-3)典型配置Critic学习率是Actor的5-10倍太高的Actor学习率会导致策略震荡太低的Critic学习率则使反馈信号滞后3.3 噪声退火策略聪明的导演会随着排练进度减少即兴发挥def update_noise(noise_scale): noise_scale * 0.9999 # 指数衰减 return max(noise_scale, 0.1) # 保持最小探索这种退火策略平衡了初期高噪声促进探索后期低噪声利于策略精修4. 完整演出Pendulum-v1实例解析让我们看一个钟摆平衡的具体案例。以下是训练循环的核心代码for episode in range(total_episodes): state env.reset() episode_reward 0 noise_scale initial_noise for step in range(max_steps): action select_action(state, noise_scale * np.random.randn(action_dim)) next_state, reward, done, _ env.step(action) replay_buffer.add(state, action, reward, next_state, done) state next_state episode_reward reward if len(replay_buffer) batch_size: states, actions, rewards, next_states, dones replay_buffer.sample(batch_size) # 转换为PyTorch张量 states torch.FloatTensor(states) actions torch.FloatTensor(actions) rewards torch.FloatTensor(rewards).unsqueeze(1) next_states torch.FloatTensor(next_states) dones torch.FloatTensor(dones).unsqueeze(1) # Critic更新 critic_optimizer.zero_grad() critic_loss compute_critic_loss(states, actions, rewards, next_states, dones) critic_loss.backward() critic_optimizer.step() # Actor更新 actor_optimizer.zero_grad() actor_loss compute_actor_loss(states) actor_loss.backward() actor_optimizer.step() # 软更新目标网络 soft_update(target_actor, actor, tau) soft_update(target_critic, critic, tau) noise_scale update_noise(noise_scale) print(fEpisode {episode}, Reward: {episode_reward})训练过程中常见的现象记录训练阶段典型现象解决方案初期 (0-1k步)奖励随机波动增加噪声规模增大回放缓冲区中期 (1k-10k步)偶尔出现高分但不稳定检查Critic损失是否收敛调整学习率后期 (10k步)性能平台期尝试减小噪声微调网络结构在Pendulum-v1环境中成功的训练通常会在约50-100个episode后开始出现稳定的摆动策略300个episode左右能达到接近最优的性能。