Collaborative Reinforcement Learning Based Unmanned Aerial Vehicle (UAV) Trajectory Design for 3D UAV Tracking

In this paper, the problem of using one active unmanned aerial vehicle (UAV) and four passive UAVs to localize a 3D target UAV in real time is investigated. In the considered model, each passive UAV receives reflection signals from the target UAV, which are initially transmitted by the active UAV. The received reflection signals allow each passive UAV to estimate the signal transmission distance which will be transmitted to a base station (BS) for the estimation of the position of the target UAV. Due to the movement of the target UAV, each active/passive UAV must optimize its trajectory to continuously localize the target UAV. Meanwhile, since the accuracy of the distance estimation depends on the signal-to-noise ratio of the transmission signals, the active UAV must optimize its transmit power. This problem is formulated as an optimization problem whose goal is to jointly optimize the transmit power of the active UAV and trajectories of both active and passive UAVs so as to maximize the target UAV positioning accuracy. To solve this problem, a Z function decomposition based reinforcement learning (ZD-RL) method is proposed. Compared to value function decomposition based RL (VD-RL), the proposed method can find the probability distribution of the sum of future rewards to accurately estimate the expected value of the sum of future rewards thus finding better transmit power of the active UAV and trajectories for both active and passive UAVs and improving target UAV positioning accuracy. Simulation results show that the proposed ZD-RL method can reduce the positioning errors by up to 39.4% and 64.6%, compared to VD-RL and independent deep RL methods, respectively.

翻译：本文研究了使用一架主动无人机与四架被动无人机组网实时定位三维目标无人机的问题。在所考虑的模型中，每架被动无人机接收来自目标无人机的反射信号（该信号最初由主动无人机发射）。被动无人机通过接收到的反射信号估计信号传输距离，并将该距离信息传输至基站用于目标无人机位置的估计。由于目标无人机的运动，每架主动/被动无人机必须优化其飞行轨迹以实现对目标无人机的持续定位。同时，由于距离估计精度取决于传输信号的信噪比，主动无人机需优化其发射功率。该问题被建模为一个优化问题，其目标是通过联合优化主动无人机的发射功率及所有主动/被动无人机的飞行轨迹，最大化目标无人机的定位精度。为解决该问题，提出了一种基于Z函数分解的强化学习（ZD-RL）方法。与基于值函数分解的强化学习（VD-RL）相比，所提方法能够求解未来奖励总和的概率分布，从而更精确地估计未来奖励总和的期望值，进而优化主动无人机的发射功率及所有无人机的飞行轨迹，提升目标定位精度。仿真结果表明，所提ZD-RL方法相较于VD-RL和独立深度强化学习方法，分别可将定位误差降低高达39.4%和64.6%。