Deep reinforcement learning (DRL) has seen remarkable success in the control of single robots. However, applying DRL to robot swarms presents significant challenges. A critical challenge is non-stationarity, which occurs when two or more robots update individual or shared policies concurrently, thereby engaging in an interdependent training process with no guarantees of convergence. Circumventing non-stationarity typically involves training the robots with global information about other agents' states and/or actions. In contrast, in this paper we explore how to remove the need for global information. We pose our problem as a Partially Observable Markov Decision Process, due to the absence of global knowledge on other agents. Using collective transport as a testbed scenario, we study two approaches to multi-agent training. In the first, the robots exchange no messages, and are trained to rely on implicit communication through push-and-pull on the object to transport. In the second approach, we introduce Global State Prediction (GSP), a network trained to forma a belief over the swarm as a whole and predict its future states. We provide a comprehensive study over four well-known deep reinforcement learning algorithms in environments with obstacles, measuring performance as the successful transport of the object to the goal within a desired time-frame. Through an ablation study, we show that including GSP boosts performance and increases robustness when compared with methods that use global knowledge.
翻译:深度强化学习在单个机器人控制方面取得了显著成功。然而,将深度强化学习应用于机器人集群面临重大挑战。其中一个关键挑战是非平稳性问题,即当两个或多个机器人同时更新个体或共享策略时,会导致相互依赖的训练过程且无法保证收敛。规避非平稳性通常需要利用其他智能体状态和/或动作的全局信息来训练机器人。相比之下,本文探索如何消除对全局信息的需求。由于缺乏对其他智能体的全局认知,我们将问题建模为部分可观测马尔可夫决策过程。以集体搬运为测试场景,我们研究了两种多智能体训练方法。第一种方法中,机器人不交换任何消息,通过搬运物体时的推拉动作训练出隐式通信能力。第二种方法中,我们提出全局状态预测网络,该网络通过训练形成对集群整体的信念并预测其未来状态。我们在含障碍物的环境中对四种著名深度强化学习算法进行全面研究,以在规定时间内成功将物体搬运至目标点作为性能指标。通过消融实验表明,与使用全局知识的方法相比,引入全局状态预测能显著提升性能并增强鲁棒性。