We study offline multi-agent reinforcement learning (RL) in Markov games, where the goal is to learn an approximate equilibrium -- such as Nash equilibrium and (Coarse) Correlated Equilibrium -- from an offline dataset pre-collected from the game. Existing works consider relatively restricted tabular or linear models and handle each equilibria separately. In this work, we provide the first framework for sample-efficient offline learning in Markov games under general function approximation, handling all 3 equilibria in a unified manner. By using Bellman-consistent pessimism, we obtain interval estimation for policies' returns, and use both the upper and the lower bounds to obtain a relaxation on the gap of a candidate policy, which becomes our optimization objective. Our results generalize prior works and provide several additional insights. Importantly, we require a data coverage condition that improves over the recently proposed "unilateral concentrability". Our condition allows selective coverage of deviation policies that optimally trade-off between their greediness (as approximate best responses) and coverage, and we show scenarios where this leads to significantly better guarantees. As a new connection, we also show how our algorithmic framework can subsume seemingly different solution concepts designed for the special case of two-player zero-sum games.
翻译:我们研究马尔可夫博弈中的离线多智能体强化学习,其目标是从预先收集的离线数据集中学习近似均衡——如纳什均衡、(粗糙)相关均衡。现有工作主要针对受限的表格模型或线性模型,并对不同均衡分别处理。本文首次提出在通用函数逼近下马尔可夫博弈中可样本高效离线学习的统一框架,实现对三种均衡的统一处理。通过采用贝尔曼一致性悲观原则,我们获得策略回报的区间估计,并利用上下界构建候选策略间隙的松弛形式作为优化目标。该结果不仅泛化了现有工作,还提供了若干新见解。关键创新在于我们提出了优于近期提出的“单边集中性”的数据覆盖条件,该条件允许对偏差策略进行选择性覆盖,这些策略能最优权衡其贪婪性(作为近似最优反应)与覆盖性,实验表明这类场景可获得显著更优的保证。作为新关联,我们进一步展示该算法框架如何统一涵盖针对双人零和博弈特例设计的截然不同的解概念。