We introduce the Laser Learning Environment (LLE), a collaborative multi-agent reinforcement learning environment in which coordination is central. In LLE, agents depend on each other to make progress (interdependence), must jointly take specific sequences of actions to succeed (perfect coordination), and accomplishing those joint actions does not yield any intermediate reward (zero-incentive dynamics). The challenge of such problems lies in the difficulty of escaping state space bottlenecks caused by interdependence steps since escaping those bottlenecks is not rewarded. We test multiple state-of-the-art value-based MARL algorithms against LLE and show that they consistently fail at the collaborative task because of their inability to escape state space bottlenecks, even though they successfully achieve perfect coordination. We show that Q-learning extensions such as prioritized experience replay and n-steps return hinder exploration in environments with zero-incentive dynamics, and find that intrinsic curiosity with random network distillation is not sufficient to escape those bottlenecks. We demonstrate the need for novel methods to solve this problem and the relevance of LLE as cooperative MARL benchmark.
翻译:本文提出激光学习环境(Laser Learning Environment, LLE),这是一个以协调为核心要素的协作式多智能体强化学习环境。在LLE中,智能体之间相互依赖才能取得进展(相互依存性),必须共同执行特定动作序列才能成功(完美协调),而完成这些联合动作不会产生任何中间奖励(零激励动态)。此类问题的挑战在于难以摆脱由相互依存步骤造成的状态空间瓶颈,因为突破这些瓶颈无法获得奖励。我们针对LLE测试了多种最先进的基于价值的多智能体强化学习算法,结果表明:尽管这些算法能够成功实现完美协调,但由于无法突破状态空间瓶颈,最终在协作任务中持续失败。研究发现,优先经验回放和多步回报等Q学习扩展方法在零激励动态环境中会阻碍探索,而基于随机网络蒸馏的内在好奇心机制也不足以突破这些瓶颈。本文论证了解决该问题需探索新方法的必要性,以及LLE作为协作式多智能体强化学习基准测试的相关价值。