Multi-agent reinforcement learning (MARL) has shown wide applicability in collaborative systems such as autonomous driving and smart cities for its ability of learning through interaction. With the recent development of drone networks, researchers have also applied MARL to address the trajectory planning problems. However, the dynamic environment and the limited battery capacity are still challenging for using MARL to achieve efficient collaborative task execution. In this paper, we propose an energy-aware MARL model as an attempt to tackle these challenges, leveraging Deep Q-Networks (DQN) with \emph{individual reward functions} driven by the task execution progress and the remaining battery of drones. We conduct a set of simulation studies for the proposed mode and compare it with the shared reward MARL~\cite{Li2022MARL} to explore the impact of credit assignment in MARL. The results indicate that our proposed model can achieve at least 80\% success rate regardless of the task locations and lengths. Similar to the shared reward mode, the individual reward mode can achieve a better success rate when the task density is high, and it can hit nearly a 100\% success rate when task density gets close to 40\%. The true advantage of our proposed model with individual reward is revealed when scaling up the environment. The comparison to the shared reward MARL shows that the our proposed model is more robust towards the change of the environment size and agent numbers. It can achieve higher success rate with fewer steps due to the clarity of the goal which improves energy efficiency even better.
翻译:多智能体强化学习(MARL)因其通过交互学习的能力,在自动驾驶、智慧城市等协作系统中展现出广泛适用性。随着无人机网络的近期发展,研究人员已将MARL应用于轨迹规划问题。然而,动态环境与有限的电池容量仍对使用MARL实现高效协同任务执行构成挑战。本文提出一种能效感知的MARL模型作为应对这些挑战的尝试,该模型利用深度Q网络(DQN),并采用基于任务执行进度与无人机剩余电量驱动的个体奖励函数。我们对所提模型开展了一系列仿真研究,并与共享奖励MARL方法~\cite{Li2022MARL}进行对比,以探究信用分配对MARL的影响。结果表明:无论任务位置与长度如何,所提模型均能实现至少80%的成功率。与共享奖励模式类似,个体奖励模式在任务密度较高时能获得更优成功率,当任务密度接近40%时成功率几乎可达100%。所提基于个体奖励模型的真正优势在环境规模扩展时得以显现。与共享奖励MARL的对比表明,所提模型对环境规模与智能体数量的变化具有更强的鲁棒性,由于目标清晰性,能以更少的步骤实现更高成功率,从而进一步提升能效。