Multi-Agent Deep Reinforcement Learning For Persistent Monitoring With Sensing, Communication, and Localization Constraints

Determining multi-robot motion policies for persistently monitoring a region with limited sensing, communication, and localization constraints in non-GPS environments is a challenging problem. To take the localization constraints into account, in this paper, we consider a heterogeneous robotic system consisting of two types of agents: anchor agents with accurate localization capability and auxiliary agents with low localization accuracy. To localize itself, the auxiliary agents must be within the communication range of an {anchor}, directly or indirectly. The robotic team's objective is to minimize environmental uncertainty through persistent monitoring. We propose a multi-agent deep reinforcement learning (MARL) based architecture with graph convolution called Graph Localized Proximal Policy Optimization (GALOPP), which incorporates the limited sensor field-of-view, communication, and localization constraints of the agents along with persistent monitoring objectives to determine motion policies for each agent. We evaluate the performance of GALOPP on open maps with obstacles having a different number of anchor and auxiliary agents. We further study (i) the effect of communication range, obstacle density, and sensing range on the performance and (ii) compare the performance of GALOPP with non-RL baselines, namely, greedy search, random search, and random search with communication constraint. For its generalization capability, we also evaluated GALOPP in two different environments -- 2-room and 4-room. The results show that GALOPP learns the policies and monitors the area well. As a proof-of-concept, we perform hardware experiments to demonstrate the performance of GALOPP.

翻译：在无GPS环境中，受限于有限感知、通信与定位约束的持续区域监控问题中，确定多机器人运动策略是一项具有挑战性的任务。为考虑定位约束，本文研究了一种包含两类智能体的异构机器人系统：具有精确定位能力的锚定智能体和低定位精度的辅助智能体。辅助智能体需要通过直接或间接方式与锚定智能体保持通信距离以实现自身定位。机器人团队的目标是通过持续监控最小化环境不确定性。我们提出了一种基于图卷积的多智能体深度强化学习架构——图局部近端策略优化（GALOPP），该架构融合了传感器有限视场、通信与定位约束以及持续监控目标，为每个智能体生成运动策略。我们在具有障碍物的开放地图上评估了GALOPP的性能，其中锚定智能体与辅助智能体的数量可调节。进一步研究：(i) 通信范围、障碍物密度和感知范围对性能的影响；(ii) 将GALOPP与非强化学习基线（贪婪搜索、随机搜索、带通信约束的随机搜索）进行性能对比。为验证其泛化能力，我们还在两种不同环境（双室与四室场景）中评估了GALOPP。结果表明，GALOPP能够有效学习策略并实现良好的区域监控。作为概念验证，我们通过硬件实验展示了GALOPP的性能表现。