Offline-to-Online Reinforcement Learning has emerged as a powerful paradigm, leveraging offline data for initialization and online fine-tuning to enhance both sample efficiency and performance. However, most existing research has focused on single-agent settings, with limited exploration of the multi-agent extension, i.e., Offline-to-Online Multi-Agent Reinforcement Learning (O2O MARL). In O2O MARL, two critical challenges become more prominent as the number of agents increases: (i) the risk of unlearning pre-trained Q-values due to distributional shifts during the transition from offline-to-online phases, and (ii) the difficulty of efficient exploration in the large joint state-action space. To tackle these challenges, we propose a novel O2O MARL framework called Offline Value Function Memory with Sequential Exploration (OVMSE). First, we introduce the Offline Value Function Memory (OVM) mechanism to compute target Q-values, preserving knowledge gained during offline training, ensuring smoother transitions, and enabling efficient fine-tuning. Second, we propose a decentralized Sequential Exploration (SE) strategy tailored for O2O MARL, which effectively utilizes the pre-trained offline policy for exploration, thereby significantly reducing the joint state-action space to be explored. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate that OVMSE significantly outperforms existing baselines, achieving superior sample efficiency and overall performance.
翻译:离线至在线强化学习已成为一种强大的范式,它利用离线数据进行初始化并通过在线微调来提升样本效率与性能。然而,现有研究大多集中于单智能体场景,对多智能体扩展——即离线至在线多智能体强化学习(O2O MARL)——的探索仍较为有限。在O2O MARL中,随着智能体数量增加,两个关键挑战变得尤为突出:(i)由于从离线阶段过渡到在线阶段时分布偏移导致预训练Q值被遗忘的风险;(ii)在巨大的联合状态-动作空间中实现高效探索的困难。为应对这些挑战,我们提出了一种名为“离线价值函数记忆与顺序探索”(OVMSE)的新型O2O MARL框架。首先,我们引入离线价值函数记忆(OVM)机制来计算目标Q值,以保留离线训练中获得的知识,确保更平滑的过渡,并实现高效的微调。其次,我们提出了一种专为O2O MARL设计的去中心化顺序探索(SE)策略,该策略能有效利用预训练的离线策略进行探索,从而显著缩小待探索的联合状态-动作空间。在星际争霸多智能体挑战(SMAC)上的大量实验表明,OVMSE显著优于现有基线方法,实现了更优的样本效率与整体性能。