Learning in general-sum games often yields collectively sub-optimal results. Addressing this, opponent shaping (OS) methods actively guide the learning processes of other agents, empirically leading to improved individual and group performances in many settings. Early OS methods use higher-order derivatives to shape the learning of co-players, making them unsuitable for shaping multiple learning steps. Follow-up work, Model-free Opponent Shaping (M-FOS), addresses these by reframing the OS problem as a meta-game. In contrast to early OS methods, there is little theoretical understanding of the M-FOS framework. Providing theoretical guarantees for M-FOS is hard because A) there is little literature on theoretical sample complexity bounds for meta-reinforcement learning B) M-FOS operates in continuous state and action spaces, so theoretical analysis is challenging. In this work, we present R-FOS, a tabular version of M-FOS that is more suitable for theoretical analysis. R-FOS discretises the continuous meta-game MDP into a tabular MDP. Within this discretised MDP, we adapt the $R_{max}$ algorithm, most prominently used to derive PAC-bounds for MDPs, as the meta-learner in the R-FOS algorithm. We derive a sample complexity bound that is exponential in the cardinality of the inner state and action space and the number of agents. Our bound guarantees that, with high probability, the final policy learned by an R-FOS agent is close to the optimal policy, apart from a constant factor. Finally, we investigate how R-FOS's sample complexity scales in the size of state-action space. Our theoretical results on scaling are supported empirically in the Matching Pennies environment.
翻译:在一般和博弈中的学习常常导致集体次优结果。为此,对手塑形(OS)方法主动引导其他智能体的学习过程,经验表明这在许多场景中能提升个体和群体表现。早期OS方法使用高阶导数来塑造共玩者的学习,因此不适用于塑造多个学习步骤。后续工作——无模型对手塑形(M-FOS)通过将OS问题重构为元博弈解决了此问题。与早期OS方法相比,M-FOS框架的理论理解仍较为匮乏。为M-FOS提供理论保证的难点在于:A) 关于元强化学习理论样本复杂度边界的文献较少;B) M-FOS在连续状态和动作空间中运行,使得理论分析具有挑战性。本文提出R-FOS,一种更适于理论分析的表格化M-FOS版本。R-FOS将连续元博弈马尔可夫决策过程(MDP)离散化为表格MDP。在此离散化MDP中,我们改编了最常用于推导MDP的PAC边界的$R_{max}$算法,将其作为R-FOS算法中的元学习器。我们推导出样本复杂度边界,该边界与内部状态-动作空间的基数及智能体数量呈指数关系。该边界保证:以高概率,R-FOS智能体最终学得的策略接近最优策略(除常数因子外)。最后,我们研究了R-FOS的样本复杂度如何随状态-动作空间规模扩展。在匹配硬币环境中,我们的规模化理论结果得到了实验支持。