We investigate multi-agent reinforcement learning for stochastic games with complex tasks, where the reward functions are non-Markovian. We utilize reward machines to incorporate high-level knowledge of complex tasks. We develop an algorithm called Q-learning with reward machines for stochastic games (QRM-SG), to learn the best-response strategy at Nash equilibrium for each agent. In QRM-SG, we define the Q-function at a Nash equilibrium in augmented state space. The augmented state space integrates the state of the stochastic game and the state of reward machines. Each agent learns the Q-functions of all agents in the system. We prove that Q-functions learned in QRM-SG converge to the Q-functions at a Nash equilibrium if the stage game at each time step during learning has a global optimum point or a saddle point, and the agents update Q-functions based on the best-response strategy at this point. We use the Lemke-Howson method to derive the best-response strategy given current Q-functions. The three case studies show that QRM-SG can learn the best-response strategies effectively. QRM-SG learns the best-response strategies after around 7500 episodes in Case Study I, 1000 episodes in Case Study II, and 1500 episodes in Case Study III, while baseline methods such as Nash Q-learning and MADDPG fail to converge to the Nash equilibrium in all three case studies.
翻译:我们研究了具有复杂任务的随机博弈中的多智能体强化学习问题,其中奖励函数是非马尔可夫性的。利用奖励机器将复杂任务的高层知识融入其中,我们提出了一种名为“随机博弈中基于奖励机器的Q学习”(QRM-SG)的算法,使每个智能体能够在纳什均衡点学习最优应对策略。在QRM-SG中,我们在扩充状态空间内定义了纳什均衡处的Q函数。该扩充状态空间整合了随机博弈的状态与奖励机器的状态。每个智能体学习系统中所有智能体的Q函数。我们证明:若在学习过程中每一步的阶段性博弈存在全局最优点或鞍点,且智能体基于该点的最优应对策略更新Q函数,则QRM-SG中学习的Q函数能够收敛至纳什均衡处的Q函数。我们采用Lemke-Howson方法根据当前Q函数推导最优应对策略。三项案例研究表明,QRM-SG能有效学习最优应对策略:案例I中约7500回合、案例II中约1000回合、案例III中约1500回合后即可完成学习;而对比基线方法如纳什Q学习和MADDPG在全部三个案例中均未能收敛至纳什均衡。