To overcome the sim-to-real gap in reinforcement learning (RL), learned policies must maintain robustness against environmental uncertainties. While robust RL has been widely studied in single-agent regimes, in multi-agent environments, the problem remains understudied -- despite the fact that the problems posed by environmental uncertainties are often exacerbated by strategic interactions. This work focuses on learning in distributionally robust Markov games (RMGs), a robust variant of standard Markov games, wherein each agent aims to learn a policy that maximizes its own worst-case performance when the deployed environment deviates within its own prescribed uncertainty set. This results in a set of robust equilibrium strategies for all agents that align with classic notions of game-theoretic equilibria. Assuming a non-adaptive sampling mechanism from a generative model, we propose a sample-efficient model-based algorithm (DRNVI) with finite-sample complexity guarantees for learning robust variants of various notions of game-theoretic equilibria. We also establish an information-theoretic lower bound for solving RMGs, which confirms the near-optimal sample complexity of DRNVI with respect to problem-dependent factors such as the size of the state space, the target accuracy, and the horizon length.
翻译:为克服强化学习(RL)中的仿真到现实差距,所学策略必须保持对环境不确定性的鲁棒性。尽管鲁棒RL已在单智能体场景中得到了广泛研究,但在多智能体环境中,这一问题仍未得到充分探讨——尽管环境不确定性带来的问题往往因策略交互而加剧。本工作聚焦于分布鲁棒马尔可夫博弈(RMGs)的学习,这是标准马尔可夫博弈的一种鲁棒变体,其中每个智能体旨在学习一种策略,使其在部署环境偏离其自身预设不确定集时,能够最大化其最坏情况下的性能。这为所有智能体生成一组鲁棒均衡策略,这些策略与经典博弈论均衡概念一致。假设生成模型采用非自适应采样机制,我们提出了一种样本高效的基于模型的算法(DRNVI),并提供有限样本复杂度保证,用于学习多种博弈论均衡概念的鲁棒变体。我们还建立了求解RMG的信息论下界,这证实了DRNVI在状态空间规模、目标精度和时域长度等问题相关因素上的近最优样本复杂度。