Offline reinforcement learning (RL) leverages previously collected data to extract policies that return satisfying performance in online environments. However, offline RL suffers from the distribution shift between the offline dataset and the online environment. In the multi-agent RL (MARL) setting, this distribution shift may arise from the nonstationary opponents (exogenous agents beyond control) in the online testing who display distinct behaviors from those recorded in the offline dataset. Hence, the key to the broader deployment of offline MARL is the online adaptation to nonstationary opponents. Recent advances in large language models have demonstrated the surprising generalization ability of the transformer architecture in sequence modeling, which prompts one to wonder \textit{whether the offline-trained transformer policy adapts to nonstationary opponents during online testing}. This work proposes the self-confirming loss (SCL) in offline transformer training to address the online nonstationarity, which is motivated by the self-confirming equilibrium (SCE) in game theory. The gist is that the transformer learns to predict the opponents' future moves based on which it acts accordingly. As a weaker variant of Nash equilibrium (NE), SCE (equivalently, SCL) only requires local consistency: the agent's local observations do not deviate from its conjectures, leading to a more adaptable policy than the one dictated by NE focusing on global optimality. We evaluate the online adaptability of the self-confirming transformer (SCT) by playing against nonstationary opponents employing a variety of policies, from the random one to the benchmark MARL policies. Experimental results demonstrate that SCT can adapt to nonstationary opponents online, achieving higher returns than vanilla transformers and offline MARL baselines.
翻译:离线强化学习利用先前收集的数据提取策略,以在在线环境中获得令人满意的性能。然而,离线强化学习面临离线数据集与在线环境之间分布偏移的问题。在多智能体强化学习场景中,这种分布偏移可能源于在线测试中具有非平稳性的对手(无法控制的外生智能体),其行为模式与离线数据集中记录的行为存在显著差异。因此,离线多智能体强化学习广泛部署的关键在于对非平稳对手的在线自适应能力。大型语言模型的最新进展表明,Transformer架构在序列建模中展现出惊人的泛化能力,这促使我们思考:离线训练的Transformer策略能否在在线测试中适应非平稳对手?本文提出离线Transformer训练中的自确认损失函数以应对在线非平稳性,其动机源于博弈论中的自确认均衡。核心思想在于,Transformer学习预测对手未来动作并据此采取行动。作为纳什均衡的弱变体,自确认均衡仅要求局部一致性:智能体的局部观测与其推测不产生偏离,从而比关注全局最优性的纳什均衡策略更具适应性。我们通过对抗采用从随机策略到基准多智能体强化学习策略等多种非平稳对手,评估了自确认Transformer的在线自适应能力。实验结果表明,自确认Transformer能够在线适应非平稳对手,获得比原始Transformer及离线多智能体强化学习基线更高的回报。