In standard reinforcement learning, an episode is defined as a sequence of interactions between agents and the environment, which terminates upon reaching a terminal state or a pre-defined episode length. Setting a shorter episode length enables the generation of multiple episodes with the same number of data samples, thereby facilitating an exploration of diverse states. While shorter episodes may limit the collection of long-term interactions, they may offer significant advantages when properly managed. For example, trajectory truncation in single-agent reinforcement learning has shown how the benefits of shorter episodes can be leveraged despite the trade-off of reduced long-term interaction experiences. However, this approach remains underexplored in MARL. This paper proposes a novel MARL approach, Adaptive Episode Length Adjustment (AELA), where the episode length is initially limited and gradually increased based on an entropy-based assessment of learning progress. By starting with shorter episodes, agents can focus on learning effective strategies for initial states and minimize time spent in dead-end states. The use of entropy as an assessment metric prevents premature convergence to suboptimal policies and ensures balanced training over varying episode lengths. We validate our approach using the StarCraft Multi-agent Challenge (SMAC) and a modified predator-prey environment, demonstrating significant improvements in both convergence speed and overall performance compared to existing methods. To the best of our knowledge, this is the first study to adaptively adjust episode length in MARL based on learning progress.
翻译:在标准强化学习中,一个回合被定义为智能体与环境之间的一系列交互,该交互在达到终止状态或预定义的回合长度时终止。设置较短的回合长度能够在相同数据样本数量下生成多个回合,从而促进对多样化状态的探索。虽然较短的回合可能会限制长期交互的收集,但若管理得当,它们可能带来显著优势。例如,单智能体强化学习中的轨迹截断已展示了如何利用较短回合的优势,尽管这需要权衡减少的长期交互经验。然而,这种方法在多智能体强化学习(MARL)中仍未得到充分探索。本文提出了一种新颖的MARL方法——自适应回合长度调整(AELA),其中回合长度最初受到限制,并基于学习进度的熵评估逐渐增加。通过从较短的回合开始,智能体可以专注于学习初始状态的有效策略,并最小化在死胡同状态中花费的时间。使用熵作为评估指标可防止过早收敛到次优策略,并确保在不同回合长度下的平衡训练。我们使用星际争霸多智能体挑战(SMAC)和修改后的捕食者-猎物环境验证了我们的方法,与现有方法相比,在收敛速度和整体性能上均显示出显著提升。据我们所知,这是首个基于学习进度自适应调整MARL中回合长度的研究。