We investigate multi-agent reinforcement learning for stochastic games with complex tasks, where the reward functions are non-Markovian. We utilize reward machines to incorporate high-level knowledge of complex tasks. We develop an algorithm called Q-learning with reward machines for stochastic games (QRM-SG), to learn the best-response strategy at Nash equilibrium for each agent. In QRM-SG, we define the Q-function at a Nash equilibrium in augmented state space. The augmented state space integrates the state of the stochastic game and the state of reward machines. Each agent learns the Q-functions of all agents in the system. We prove that Q-functions learned in QRM-SG converge to the Q-functions at a Nash equilibrium if the stage game at each time step during learning has a global optimum point or a saddle point, and the agents update Q-functions based on the best-response strategy at this point. We use the Lemke-Howson method to derive the best-response strategy given current Q-functions. The three case studies show that QRM-SG can learn the best-response strategies effectively. QRM-SG learns the best-response strategies after around 7500 episodes in Case Study I, 1000 episodes in Case Study II, and 1500 episodes in Case Study III, while baseline methods such as Nash Q-learning and MADDPG fail to converge to the Nash equilibrium in all three case studies.
翻译:我们研究了复杂任务下随机博弈的多智能体强化学习问题,其中奖励函数是非马尔可夫性的。利用奖励机来整合复杂任务的高层知识,我们提出了一种名为随机博弈奖励机Q学习的算法(QRM-SG),用于在纳什均衡中学习每个智能体的最优反应策略。在QRM-SG中,我们在增广状态空间中定义了纳什均衡处的Q函数。该增广状态空间集成了随机博弈的状态与奖励机的状态。每个智能体学习系统中所有智能体的Q函数。我们证明:若学习过程中每个时间步的阶段博弈存在全局最优点或鞍点,且智能体基于该点的最优反应策略更新Q函数,则QRM-SG学习到的Q函数将收敛至纳什均衡处的Q函数。采用Lemke-Howson方法根据当前Q函数推导最优反应策略。三项案例研究表明,QRM-SG能有效学习最优反应策略:案例I中约7500个回合、案例II中1000个回合、案例III中1500个回合即可收敛,而Nash Q学习和MADDPG等基线方法在全部三项案例中均未能收敛至纳什均衡。