Maintaining the freshness of information in the Internet of Things (IoT) is a critical yet challenging problem. In this paper, we study cooperative data collection using multiple Unmanned Aerial Vehicles (UAVs) with the objective of minimizing the total average Age of Information (AoI). We consider various constraints of the UAVs, including kinematic, energy, trajectory, and collision avoidance, in order to optimize the data collection process. Specifically, each UAV, which has limited on-board energy, takes off from its initial location and flies over sensor nodes to collect update packets in cooperation with the other UAVs. The UAVs must land at their final destinations with non-negative residual energy after the specified time duration to ensure they have enough energy to complete their missions. It is crucial to design the trajectories of the UAVs and the transmission scheduling of the sensor nodes to enhance information freshness. We model the multi-UAV data collection problem as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), as each UAV is unaware of the dynamics of the environment and can only observe a part of the sensors. To address the challenges of this problem, we propose a multi-agent Deep Reinforcement Learning (DRL)-based algorithm with centralized learning and decentralized execution. In addition to the reward shaping, we use action masks to filter out invalid actions and ensure that the constraints are met. Simulation results demonstrate that the proposed algorithms can significantly reduce the total average AoI compared to the baseline algorithms, and the use of the action mask method can improve the convergence speed of the proposed algorithm.
翻译:在物联网中维持信息的新鲜度是一个关键且具有挑战性的问题。本文研究利用多架无人机的协同数据采集,以最小化总平均信息年龄为目标。我们考虑无人机的运动学、能量、轨迹和避碰等多种约束,以优化数据采集过程。具体而言,每架机载能量有限的无人机从初始位置起飞,飞越传感器节点,与其他无人机协作采集更新数据包。无人机必须在指定时间结束后降落在最终目的地,且剩余能量非负,以确保有足够能量完成任务。设计无人机轨迹和传感器节点传输调度对于提升信息新鲜度至关重要。我们将多无人机数据采集问题建模为去中心化部分可观测马尔可夫决策过程,因为每架无人机无法感知环境动态,且只能观测部分传感器。为应对该问题的挑战,我们提出一种基于多智能体深度强化学习的算法,采用集中式训练与分布式执行。除奖励塑造外,我们使用动作掩码过滤无效动作并确保约束满足。仿真结果表明,与基线算法相比,所提算法能显著降低总平均信息年龄,且动作掩码方法可提升算法收敛速度。