Knowledge distillation (KD) has the potential to accelerate MARL by employing a centralized teacher for decentralized students but faces key bottlenecks. Specifically, there are (1) challenges in synthesizing high-performing teaching policies in complex domains, (2) difficulties when teachers must reason in out-of-distribution (OOD) states, and (3) mismatches between the decentralized students' and the centralized teacher's observation spaces. To address these limitations, we propose HINT (Hierarchical INteractive Teacher-based transfer), a novel KD framework for MARL in a centralized training, decentralized execution setup. By leveraging hierarchical RL, HINT provides a scalable, high-performing teacher. Our key innovation, pseudo off-policy RL, enables the teacher policy to be updated using both teacher and student experience, thereby improving OOD adaptation. HINT also applies performance-based filtering to retain only outcome-relevant guidance, reducing observation mismatches. We evaluate HINT on challenging cooperative domains (e.g., FireCommander for resource allocation, MARINE for tactical combat). Across these benchmarks, HINT outperforms baselines, achieving improvements of 60% to 165% in success rate.
翻译:知识蒸馏(KD)通过采用集中式教师指导分散式学生,具有加速多智能体强化学习(MARL)的潜力,但面临关键瓶颈。具体而言存在:(1)在复杂领域中合成高性能教学策略的挑战;(2)教师必须在分布外(OOD)状态进行推理时的困难;(3)分散式学生与集中式教师观测空间不匹配的问题。为突破这些限制,我们提出HINT(基于分层交互式教师的迁移框架),这是一种面向集中训练分散执行设置的MARL新型KD框架。通过利用分层强化学习,HINT提供了可扩展的高性能教师策略。我们的核心创新——伪离策略强化学习,使教师策略能够同时利用教师和学生的经验进行更新,从而提升OOD适应能力。HINT还采用基于性能的过滤机制,仅保留与任务结果相关的指导信息,以降低观测不匹配的影响。我们在具有挑战性的协作领域(如资源分配的FireCommander、战术作战的MARINE)评估HINT。在这些基准测试中,HINT均优于基线方法,成功率提升幅度达60%至165%。