To overcome the sim-to-real gap in reinforcement learning (RL), learned policies must maintain robustness against environmental uncertainties. While robust RL has been widely studied in single-agent regimes, in multi-agent environments, the problem remains understudied -- despite the fact that the problems posed by environmental uncertainties are often exacerbated by strategic interactions. This work focuses on learning in distributionally robust Markov games (RMGs), a robust variant of standard Markov games, wherein each agent aims to learn a policy that maximizes its own worst-case performance when the deployed environment deviates within its own prescribed uncertainty set. This results in a set of robust equilibrium strategies for all agents that align with classic notions of game-theoretic equilibria. Assuming a non-adaptive sampling mechanism from a generative model, we propose a sample-efficient model-based algorithm (DRNVI) with finite-sample complexity guarantees for learning robust variants of various notions of game-theoretic equilibria. We also establish an information-theoretic lower bound for solving RMGs, which confirms the near-optimal sample complexity of DRNVI with respect to problem-dependent factors such as the size of the state space, the target accuracy, and the horizon length.
翻译:为克服强化学习(RL)中的仿真到现实差距,学习到的策略必须保持对环境不确定性的鲁棒性。尽管鲁棒RL已在单智能体场景中得到广泛研究,但在多智能体环境中,该问题仍鲜有充分探索——尽管策略交互往往会加剧环境不确定性所带来的挑战。本文聚焦于分布式鲁棒马尔可夫博弈(RMGs)的学习,这是标准马尔可夫博弈的一种鲁棒变体。在该博弈中,每个智能体旨在学习一种策略,使其在面对指定不确定性集合内的环境偏差时,自身最坏情况下的性能最大化。这导致所有智能体形成一组鲁棒均衡策略,这些策略与经典博弈论均衡概念一致。假设采用生成模型的非自适应采样机制,我们提出了一种样本高效的基于模型的算法(DRNVI),并给出了学习博弈论均衡各种概念的鲁棒变体所需的有限样本复杂度保证。此外,我们建立了求解RMGs的信息论下界,这证实了DRNVI在状态空间规模、目标精度和时域长度等与问题相关的因素方面具有近乎最优的样本复杂度。