强化学习策略参数调节方法及值迭代算法实现 CS188 Proj3 学习笔记 强烈推荐的更好的阅读体验Q1.Value Iteration第一个问题是最基础的值迭代实现这个问题没有什么难度主要就是一边看着公式一遍敲代码复现。可以先回顾一下Note8中的Value Iteration框架.唯一唯一需要注意的就是需要使用的是batch版本而不是online版本。这里是前面Note没有提及到的概念用图片来理解一下会更好一点这个问题用的数据结构是Counter它的底层容器是哈希表。Counter十分类似于Dictionary,它只是新增了一条设定即所有初始值都为0。上图中一张图就可以理解为一个Counter每个状态都有对应的Value,就和键值对应一样所谓online版本就是在某轮循环更新某个状态时你用了本轮其它状态的值。可以参考图一假设遍历的第一个状态是效用值为1的状态1左边的方格在第一轮也会受到效用值1的影响。这样就是偷看了本轮新值。图一到图二经历的过程就是batch版本这个过程是由图一推演出来的而并非参考第二轮迭代的新值。这样就严格遵循了V k V_kVk​是由V k − 1 V_{k-1}Vk−1​推演出来的定式代码实现defrunValueIteration(self): Run the value iteration algorithm. Note that in standard value iteration, V_k1(...) depends on V_k(...)s. *** YOUR CODE HERE ***foriinrange(self.iterations):newValuesutil.Counter()forstateinself.mdp.getStates():ifself.mdp.isTerminal(state):newValues[state]0continueactionsself.mdp.getPossibleActions(state)ifnotactions:newValues[state]0continueqValues[]foractioninactions:qself.computeQValueFromValues(state,action)qValues.append(q)newValues[state]max(qValues)self.valuesnewValuesdefgetValue(self,state): Return the value of the state (computed in __init__). returnself.values[state]defcomputeQValueFromValues(self,state,action): Compute the Q-value of action in state from the value function stored in self.values. *** YOUR CODE HERE ***qValue0transitionsself.mdp.getTransitionStatesAndProbs(state,action)fornextState,probintransitions:rewardself.mdp.getReward(state,action,nextState)qValueprob*(rewardself.discount*self.getValue(nextState))returnqValue util.raiseNotDefined()defcomputeActionFromValues(self,state): The policy is the best action in the given state according to the values currently stored in self.values. You may break ties any way you see fit. Note that if there are no legal actions, which is the case at the terminal state, you should return None. *** YOUR CODE HERE ***ifself.mdp.isTerminal(state):returnNoneactionsself.mdp.getPossibleActions(state)ifnotactions:returnNonebestActionNonebestValuefloat(-inf)foractioninactions:qself.computeQValueFromValues(state,action)ifqbestValue:bestValueq bestActionactionreturnbestAction util.raiseNotDefined()整体思路并不难Coding过程只需要注意一下不要遗漏处理No Leagal Action的情况就可以了Q2.PoliciesQ2问题更简单了这就是凭直觉调节参数的题目需要注意的是有三个变量Discount如果更偏重眼前的利益就应给在Value Iteration中给更远的奖励值上更小的折扣如果想要长远的奖励值就只能让其折扣更大避免长期奖励被过度削弱Noise如果游走在Cliff附近就要调小噪音避免不确定行为发生如果绕远路则鼓励Exploration,多去探索可能性LivingReward如果目的是一直不退出则将存活奖励调大反之则调小代码实现defquestion2a(): Prefer the close exit (1), risking the cliff (-10). answerDiscount0.3answerNoise0.0answerLivingReward0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return NOT POSSIBLEdefquestion2b(): Prefer the close exit (1), but avoiding the cliff (-10). answerDiscount0.3answerNoise0.2answerLivingReward0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return NOT POSSIBLEdefquestion2c(): Prefer the distant exit (10), risking the cliff (-10). answerDiscount0.8answerNoise0.0answerLivingReward0.0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return NOT POSSIBLEdefquestion2d(): Prefer the distant exit (10), avoiding the cliff (-10). answerDiscount0.8answerNoise0.3answerLivingReward0returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return NOT POSSIBLEdefquestion2e(): Avoid both exits and the cliff (so an episode should never terminate). answerDiscount0.9answerNoise0.0answerLivingReward1returnanswerDiscount,answerNoise,answerLivingReward# If not possible, return NOT POSSIBLEQ3.Q-LearningQ3稍微有一点复杂但是难度并不大主要就是围绕着Q-Learning的实现来Coding。其中需要完成的函数共有五个唯一需要注意的点就是在computeActionFromQValues函数中面临着相同状态同样最好的QValues需要用到random.choice()来进行随机选择要不然会不能通过autograder代码实现classQLearningAgent(ReinforcementAgent): Q-Learning Agent Functions you should fill in: - computeValueFromQValues - computeActionFromQValues - getQValue - getAction - update Instance variables you have access to - self.epsilon (exploration prob) - self.alpha (learning rate) - self.discount (discount rate) Functions you should use - self.getLegalActions(state) which returns legal actions for a state def__init__(self,**args):You can initialize Q-values here...ReinforcementAgent.__init__(self,**args)*** YOUR CODE HERE ***self.qValuesutil.Counter()defgetQValue(self,state,action): Returns Q(state,action) Should return 0.0 if we have never seen a state or the Q node value otherwise *** YOUR CODE HERE ***returnself.qValues[(state,action)]util.raiseNotDefined()defcomputeValueFromQValues(self,state): Returns max_action Q(state,action) where the max is over legal actions. Note that if there are no legal actions, which is the case at the terminal state, you should return a value of 0.0. *** YOUR CODE HERE ***actionsself.getLegalActions(state)bestQvalue-float(inf)ifnotactions:return0.0foractioninactions:ifself.getQValue(state,action)bestQvalue:bestQvalueself.getQValue(state,action)returnbestQvalue# return max([self.getQValue(state, action) for action in actions])util.raiseNotDefined()defcomputeActionFromQValues(self,state): Compute the best action to take in a state. Note that if there are no legal actions, which is the case at the terminal state, you should return None. *** YOUR CODE HERE ***actionsself.getLegalActions(state)bestQvalueself.computeValueFromQValues(state)ifnotactions:returnNonebestActions[actionforactioninactionsifself.getQValue(state,action)bestQvalue]returnrandom.choice(bestActions)util.raiseNotDefined()defgetAction(self,state): Compute the action to take in the current state. With probability self.epsilon, we should take a random action and take the best policy action otherwise. Note that if there are no legal actions, which is the case at the terminal state, you should choose None as the action. HINT: You might want to use util.flipCoin(prob) HINT: To pick randomly from a list, use random.choice(list) # Pick ActionlegalActionsself.getLegalActions(state)actionNone*** YOUR CODE HERE ***ifnotlegalActions:returnNoneifutil.flipCoin(self.epsilon):returnrandom.choice(legalActions)else:returnself.computeActionFromQValues(state)util.raiseNotDefined()defupdate(self,state,action,nextState,reward:float): The parent class calls this to observe a state action nextState and reward transition. You should do your Q-Value update here NOTE: You should never call this function, it will be called on your behalf *** YOUR CODE HERE ***samplerewardself.discount*self.computeValueFromQValues(nextState)oldQself.getQValue(state,action)self.qValues[(state,action)](1-self.alpha)*oldQself.alpha*sampledefgetPolicy(self,state):returnself.computeActionFromQValues(state)defgetValue(self,state):returnself.computeValueFromQValues(state)其中在computeValueFromQValues函数中带#号的一行为for循环以下的简便写法我写的版本易读性有些差而且简洁性也不够强。在初始化中我们依旧选择Counter作为我们的数据存储结构因为其初始化值为0的特性我们的Coding过程方便了许多。在函数update中就是围绕着Q-Learning的核心公式展开的这个公式可以回顾Note9中的详细介绍函数getAction中的util.flipCoin(p)的功能不要遗漏看一下在util.py文件中给出了如下实现。实现非常简单含义也很明了用在getAction函数中的的判断语句含义就是有self.epsilon的概率进入if分支有1 - self.epsilon的概率进入else分支。这也是在Note10中提及到的ε-Greedy PoliciesdefflipCoin(p):rrandom.random()returnrpQ4.Epsilon GreedyQ4问题在Q3中已经实现了没看清要求。正是上面刚刚提到的ε-Greedy Policies原文档中也讲解了一下util.flipCoin(p)的具体逻辑。You can simulate a binary variable with probability p of success by using util.flipCoin(p), which returns True with probability p and False with probability 1-p.原文档中还给了两段几乎相同的shell指令python gridworld.py-aq-k100--noise0.0-e0.110% 随机探索90% 走当前最优 Qpython gridworld.py-aq-k100--noise0.0-e0.990% 随机探索只有 10% 走最优Q可以很明显发现第一段shell运行的时候智能体在探索到Reward最高的道路时就基本一直重复走相同路线。而第二段智能体则基本一直在无规则运动Q5.Q-Learning and Pacman上面的代码可以直接通过Q5的autograder。需要理解并回顾一下的是mediumGrid在用Q-Learning去学习是行不通的因为其状态空间巨大Q-Learning并不具备泛化能力.智能体意识不到遇到ghost是坏事智能体只能记住在某个具体board下撞鬼是坏事Q6.Approximate Q-LearningQ6所呈现的Approximate Q-Learning就具备的泛化能力智能体能够学习经验而不是学习特定的情况下该做出什么特定的行动。这个问题并不难文档里也提供了可能需要的函数的定义。我们可以发现Approximate Q-Learning的总表达式启发式的表达式和评估函数的表达式是有点类似的在Proj1中的Q6.遍历角落问题的启发式有着启发式的具体实现在Proj2中Q1.Reflex Agent的也可以回顾一下观察三者的形式,他们都有着共同的思想Approximate Q-Learning表达式Q ( s , a ) ∑ i w i f i ( s , a ) \begin{align*} Q(s,a) \sum_{i} w_i f_i(s,a) \end{align*}Q(s,a)​i∑​wi​fi​(s,a)​启发式表达式h ( s ) ∑ i w i f i ( s ) \begin{align*} h(s) \sum_{i} w_i f_i(s) \end{align*}h(s)​i∑​wi​fi​(s)​评估函数表达式Eval ⁡ ( s ) ∑ i w i f i ( s ) \begin{align*} \operatorname{Eval}(s) \sum_{i} w_i f_i(s) \end{align*}Eval(s)​i∑​wi​fi​(s)​代码实现classApproximateQAgent(PacmanQAgent): ApproximateQLearningAgent You should only have to overwrite getQValue and update. All other QLearningAgent functions should work as is. def__init__(self,extractorIdentityExtractor,**args):self.featExtractorutil.lookup(extractor,globals())()PacmanQAgent.__init__(self,**args)self.weightsutil.Counter()defgetWeights(self):returnself.weightsdefgetQValue(self,state,action): Should return Q(state,action) w * featureVector where * is the dotProduct operator *** YOUR CODE HERE ***featuresself.featExtractor.getFeatures(state,action)qValue0.0forfinfeatures:qValueself.weights[f]*features[f]returnqValue util.raiseNotDefined()defupdate(self,state,action,nextState,reward:float): Should update your weights based on transition *** YOUR CODE HERE ***featuresself.featExtractor.getFeatures(state,action)currentQself.getQValue(state,action)nextValueself.computeValueFromQValues(nextState)difference(rewardself.discount*nextValue)-currentQforfinfeatures:self.weights[f]self.alpha*difference*features[f]deffinal(self,state):Called at the end of each game.# call the super-class final methodPacmanQAgent.final(self,state)# did we finish training?ifself.episodesSoFarself.numTraining:# you might want to print your weights here for debugging*** YOUR CODE HERE ***pass其中FeatureExtractor类中的getFeatures函数定义如下classFeatureExtractor:defgetFeatures(self,state,action): Returns a dict from features to counts Usually, the count will just be 1.0 for indicator functions. util.raiseNotDefined()整体实现并没有什么难点只是需要对着公式用代码复刻一遍就好