多智能体搜索算法Python实现 CS188 Proj2 学习笔记 更好的阅读体验Q1.Reflex AgentProj2的开头是一个完成评估函数的问题评估函数的思想非常类似与启发式的思想。相较与Proj1不同的是该过程添加了对抗智能体还有ScaredTime的设定。完成评估函数的时候首先考虑就是评估函数的组成要素其中可以大致表示为总分 游戏自带分数 食物相关奖励 scared鬼奖励 - stop惩罚 - 危险鬼惩罚原评估函数在Reflex Agent类中的scoreEvaluationFunction中实现其只是返回了当前状态的分数缺少很多必要的组成元素可以用下列指令在shell中运行原评估函数的gui版本python pacman.py-pReflexAgent-ltestClassic很容易发现的是pacman在许多情况下采取了action集中的stop动作。这种行为及其拉低效率所以把当采取stop这一步的影响也要算入评估函数的表达式中代码实现defevaluationFunction(self,currentGameState:GameState,action): Design a better evaluation function here. The evaluation function takes in the current and proposed successor GameStates (pacman.py) and returns a number, where higher numbers are better. The code below extracts some useful information from the state, like the remaining food (newFood) and Pacman position after moving (newPos). newScaredTimes holds the number of moves that each ghost will remain scared because of Pacman having eaten a power pellet. Print out these variables to see what youre getting, then combine them to create a masterful evaluation function. # Useful information you can extract from a GameState (pacman.py)successorGameStatecurrentGameState.generatePacmanSuccessor(action)newPossuccessorGameState.getPacmanPosition()newFoodsuccessorGameState.getFood()newGhostStatessuccessorGameState.getGhostStates()newScaredTimes[ghostState.scaredTimerforghostStateinnewGhostStates]*** YOUR CODE HERE ***# print(Action: , action)# print(Pacmans position: , newPos)# print(Food count: , newFood.count())# print(Ghost positions: , [g.getPosition() for g in newGhostStates])# print(Scared times: , newScaredTimes)# print(- * 30)scoresuccessorGameState.getScore()ifactionDirections.STOP:score-10foodListnewFood.asList()iffoodList:minFoodDistmin(manhattanDistance(newPos,food)forfoodinfoodList)score10.0/minFoodDistforghostStateinnewGhostStates:ghostPosghostState.getPosition()distmanhattanDistance(ghostPos,newPos)ifghostState.scaredTimer0:ifdist0:score5.0/distelse:ifdist1:score-200else:score-5.0/distreturnscore在Q1原文的提示中提及到了尝试数值的倒数而不是数值本身在本阶段的代码实现也有体现Note: As features, try the reciprocal of important values (such as distance to food) rather than just the values themselves.最终结果Record: Win, Win, Win, Win, Win, Win, Win, Win, Win, Win *** PASS: test_cases/q1/grade-agent.test (4 of 4 points) *** 1278.3 average score (2 of 2 points) *** Grading scheme: *** 500: 0 points *** 500: 1 points *** 1000: 2 points *** 10 games not timed out (0 of 0 points) *** Grading scheme: *** 10: fail *** 10: 0 points *** 10 wins (2 of 2 points) *** Grading scheme: *** 1: fail *** 1: 0 points *** 5: 1 points *** 10: 2 points ### Question q1: 4/4 ### Finished at 20:20:52 Provisional grades Question q1: 4/4 ------------------ Total: 4/4在Q1测试中是可以完美拿到满分的但是我运行了shell指令更换其他的大型复杂layout,发现智能体在后期很容易出现反复踱步的问题。经过AI的详细解答后总结原因得知决策规则里有一行关键代码影响了智能体的行动决策过程chosenIndexrandom.choice(bestIndices)到后期之后食物减少的同时鬼的距离也变远两个方向的1/distance几乎一样而且ghost的距离也差不多总体来说多个action的分数会出现一样的情况。上述代码面对这种情况的时候会出现A → B → A → B的决策过程掉入Note4中提及到的局部最优震荡陷阱,每一步都是最优但整体状况上不前进。从根源上讲这和ReflexAgent天生没有长期记忆和规划能力有关系缓解这个问题的解决方式可以是当检测到新位置和当前位置重复连续出现太多次进行扣分操作Q2.Minimax首先可以回忆一下Note5中Minimax的伪代码在实现文件中代码已经提供了可能用到的函数非常全面。需要注意的是minimax树中的叶节点用self.evaluationFunction来评判状态分数代码实现classMinimaxAgent(MultiAgentSearchAgent): Your minimax agent (question 2) defgetAction(self,gameState:GameState): Returns the minimax action from the current gameState using self.depth and self.evaluationFunction. Here are some method calls that might be useful when implementing minimax. gameState.getLegalActions(agentIndex): Returns a list of legal actions for an agent agentIndex0 means Pacman, ghosts are 1 gameState.generateSuccessor(agentIndex, action): Returns the successor game state after an agent takes an action gameState.getNumAgents(): Returns the total number of agents in the game gameState.isWin(): Returns whether or not the game state is a winning state gameState.isLose(): Returns whether or not the game state is a losing state *** YOUR CODE HERE ***defminimax(state,agentIndex,depth):ifstate.isWin()orstate.isLose()ordepthself.depth:returnself.evaluationFunction(state)numAgentsstate.getNumAgents()ifagentIndex0:value-float(inf)foractioninstate.getLegalActions(0):successorstate.generateSuccessor(0,action)valuemax(value,minimax(successor,1,depth))returnvalueelse:valuefloat(inf)nextAgentIndexagentIndex1foractioninstate.getLegalActions(agentIndex):successorstate.generateSuccessor(agentIndex,action)ifnextAgentIndexnumAgents:valuemin(value,minimax(successor,0,depth1))else:valuemin(value,minimax(successor,nextAgentIndex,depth))returnvalue bestScore-float(inf)bestActionNoneforactioningameState.getLegalActions():successorgameState.generateSuccessor(0,action)scoreminimax(successor,1,0)ifscorebestScore:bestScorescore bestActionactionreturnbestAction这里一定要注意depth的逻辑一次搜索是一次pacman和所有ghost的反应即depth为2时涉及pacman和每个ghost移动两次这里我想了一下代码具体运行过程才能想明白假设self.depth是1,也就是pacman和ghost的移动次数都为1。最开始函数调用传入的minimax(successor, 1, 0),到运行到最后一个ghost智能体时则会执行value min(value, minimax(successor, 0, depth 1))再运行则会触发终止条件depth self.depth,返回状态的分数return self.evaluationFunction()结束递归嵌套,所以说最终depth还是为1的Q3.Alpha-Beta Pruning依旧可以回忆一下Note5中Alpha-Beta Pruning的伪代码有个小细节需要注意的是在Proj2的文档中函数的判断条件稍微改了一下如果不按下面的伪代码实现的话最后autograder没法通过6-tied-root.test*** FAIL: test_cases/q3/6-tied-root.test *** Incorrect generated nodes for depth3 *** Student generated nodes: A B max min1 min2 *** Correct generated nodes: A B C max min1 min2 *** Tree: *** max *** / \ *** min1 min2 *** | / \ *** A B C *** 10 10 0代码实现classAlphaBetaAgent(MultiAgentSearchAgent): Your minimax agent with alpha-beta pruning (question 3) defgetAction(self,gameState:GameState): Returns the minimax action using self.depth and self.evaluationFunction *** YOUR CODE HERE ***defalphabeta(state,alpha,beta,depth,agentIndex):ifstate.isWin()orstate.isLose()ordepthself.depth:returnself.evaluationFunction(state)numAgentsstate.getNumAgents()ifagentIndex0:value-float(inf)foractioninstate.getLegalActions(0):successorstate.generateSuccessor(0,action)valuemax(value,alphabeta(successor,alpha,beta,depth,1))ifvaluebeta:returnvalue alphamax(alpha,value)returnvalueelse:valuefloat(inf)nextAgentIndexagentIndex1foractioninstate.getLegalActions(agentIndex):successorstate.generateSuccessor(agentIndex,action)ifnextAgentIndexnumAgents:valuemin(value,alphabeta(successor,alpha,beta,depth1,0))else:valuemin(value,alphabeta(successor,alpha,beta,depth,nextAgentIndex))ifvaluealpha:returnvalue betamin(beta,value)returnvalue bestScore-float(inf)bestActionNonealpha-float(inf)betafloat(inf)foractioningameState.getLegalActions(0):successorgameState.generateSuccessor(0,action)scorealphabeta(successor,alpha,beta,0,1)ifscorebestScore:bestScorescore bestActionaction alphamax(alpha,bestScore)returnbestAction代码实现的整体框架和Minimax基本没有差异唯一需要注意的点就是根节点作为最大化节点也是需要剪枝的所以说一定不能忘了倒数第二行的alpha max(alpha, bestScore)这里还需加强理解一下alpha和beta在各个阶段的具体数值才能深刻理解alpha-beta pruning的运行过程拿下图来举个例子!最开始根节点值为None,alpha -float(inf)且beta float(inf)走最左子树min1开始遍历自己的一个子节点(3)遍历过后更新min1的beta 3之后遍历后两个子节点(12和6)得到min1的值已知min1的值为3,根节点的alpha开始更新执行alpha max(alpha, bestScore)语句此刻根节点的alpha 3遍历中间子树根节点alpha的值传了下来当遍历到中间子树的第一个子节点(2)触发判断语句value alpha就直接退出此次遍历了Q4.Expectimax依旧依旧是可以回顾一下Note6中的伪代码代码实现classExpectimaxAgent(MultiAgentSearchAgent): Your expectimax agent (question 4) defgetAction(self,gameState:GameState): Returns the expectimax action using self.depth and self.evaluationFunction All ghosts should be modeled as choosing uniformly at random from their legal moves. *** YOUR CODE HERE ***defexpectimax(state,agentIndex,depth):ifstate.isWin()orstate.isLose()ordepthself.depth:returnself.evaluationFunction(state)agentNumstate.getNumAgents()ifagentIndex0:value-float(inf)foractioninstate.getLegalActions(0):successorstate.generateSuccessor(0,action)valuemax(value,expectimax(successor,1,depth))else:value0actionsstate.getLegalActions(agentIndex)nextAgentIndexagentIndex1foractioninstate.getLegalActions(agentIndex):successorstate.generateSuccessor(agentIndex,action)probability1/len(actions)ifnextAgentIndexagentNum:valueprobability*expectimax(successor,0,depth1)else:valueprobability*expectimax(successor,nextAgentIndex,depth)returnvalue bestActionNonebestScore-float(inf)foractioningameState.getLegalActions(0):successorgameState.generateSuccessor(0,action)scoreexpectimax(successor,1,0)ifscorebestScore:bestScorescore bestActionactionreturnbestAction util.raiseNotDefined()Q4这个问题和前面几个问题代码基本上都是一个模板直接套就行唯一需要注意的地方是有关probability的代码对手将在getLegalActions中纯随机选择,所以probability的实现是probability 1 / len(actions),原文档如下To simplify your code, assume you will only be running against an adversary which chooses amongst their getLegalActions uniformly at random.原文档还提到了AlphaBetaAgent(以Minimax为基础)和ExpectimaxAgent的运行时候的区别python pacman.py-pAlphaBetaAgent-ltrappedClassic-adepth3-q-n10python pacman.py-pExpectimaxAgent-ltrappedClassic-adepth3-q-n10比较容易观察发现的现象是ExpectimaxAgent可以在多局比赛中赢得几乎一半的数量AlphaBetaAgent总是输。导致这个现象的根本原因就是在Note6开头中提及到的Minimax太过悲观,AlphaBetaAgent总是假设对手会采取最优行动所以pacman会直接放弃争取机会。而Expectimax中认定ghost是随机行动在计算期望值后会在关键时刻决定赌一把Q5.Evaluation Function和Q1问题相同都是完成Evaluation Function函数传入的参数里面没有action,所以没有stop惩罚了感觉没有了灵魂。我的整体思路是总分 游戏自带分数 最近食物距离 - 食物数量(长期目标) - 普通鬼惩罚 可吃鬼奖励 - 胶囊距离惩罚代码实现defbetterEvaluationFunction(currentGameState:GameState): Your extreme ghost-hunting, pellet-nabbing, food-gobbling, unstoppable evaluation function (question 5). DESCRIPTION: write something here so we know what you did *** YOUR CODE HERE ***poscurrentGameState.getPacmanPosition()foodcurrentGameState.getFood()ghostStatescurrentGameState.getGhostStates()scaredTimes[ghostState.scaredTimerforghostStateinghostStates]capsulescurrentGameState.getCapsules()scorecurrentGameState.getScore()foodListfood.asList()iffoodList:distToClosestFoodmin(manhattanDistance(pos,food)forfoodinfoodList)score20.0/distToClosestFoodelse:score1000score-5*len(foodList)ifcapsules:distToClosestCapsulemin(manhattanDistance(pos,capsule)forcapsuleincapsules)score5.0/distToClosestCapsule score-20*len(capsules)forghostinghostStates:ghostPosghost.getPosition()ghostDistmanhattanDistance(pos,ghostPos)ifghost.scaredTimer0:score100.0/ghostDistelse:ifghostDist1:score-500elifghostDist3:score-200elifghostDist5:score-100returnscore基本思路和Q1是一样的没有了Stop的惩罚让pacman在执行动作过程中会出现特别特别多的无厘头停止动作启发式内容仍需要优化但是可以pass autograder