We investigate multi-agent reinforcement learning for stochastic games with complex tasks, where the reward functions are non-Markovian. We utilize reward machines to incorporate high-level knowledge of complex tasks. We develop an algorithm called Q-learning with reward machines for stochastic games (QRM-SG), to learn the best-response strategy at Nash equilibrium for each agent. In QRM-SG, we define the Q-function at a Nash equilibrium in augmented state space. The augmented state space integrates the state of the stochastic game and the state of reward machines. Each agent learns the Q-functions of all agents in the system. We prove that Q-functions learned in QRM-SG converge to the Q-functions at a Nash equilibrium if the stage game at each time step during learning has a global optimum point or a saddle point, and the agents update Q-functions based on the best-response strategy at this point. We use the Lemke-Howson method to derive the best-response strategy given current Q-functions. The three case studies show that QRM-SG can learn the best-response strategies effectively. QRM-SG learns the best-response strategies after around 7500 episodes in Case Study I, 1000 episodes in Case Study II, and 1500 episodes in Case Study III, while baseline methods such as Nash Q-learning and MADDPG fail to converge to the Nash equilibrium in all three case studies.
翻译:我们研究了复杂任务随机博弈中的多智能体强化学习,其中奖励函数是非马尔可夫性的。通过引入奖励机,我们整合了复杂任务的高层知识。我们提出了一种名为“随机博弈奖励机Q学习”(QRM-SG)的算法,用于学习每个智能体在纳什均衡下的最优响应策略。在QRM-SG中,我们在增强状态空间中定义了纳什均衡下的Q函数,该增强状态空间融合了随机博弈的状态与奖励机的状态。每个智能体学习系统中所有智能体的Q函数。我们证明,若学习过程中每时间步的阶段博弈存在全局最优点或鞍点,且智能体基于该点的最优响应策略更新Q函数,则QRM-SG所学的Q函数将收敛至纳什均衡下的Q函数。我们采用Lemke-Howson方法根据当前Q函数推导最优响应策略。三个案例研究表明,QRM-SG能有效学习最优响应策略:在案例I中经过约7500个回合、案例II中约1000个回合、案例III中约1500个回合即可收敛;而Nash Q学习及MADDPG等基线方法在所有三个案例中均未能收敛至纳什均衡。