In this paper, we study the Multi-Start Team Orienteering Problem (MSTOP), a mission re-planning problem where vehicles are initially located away from the depot and have different amounts of fuel. We consider/assume the goal of multiple vehicles is to travel to maximize the sum of collected profits under resource (e.g., time, fuel) consumption constraints. Such re-planning problems occur in a wide range of intelligent UAS applications where changes in the mission environment force the operation of multiple vehicles to change from the original plan. To solve this problem with deep reinforcement learning (RL), we develop a policy network with self-attention on each partial tour and encoder-decoder attention between the partial tour and the remaining nodes. We propose a modified REINFORCE algorithm where the greedy rollout baseline is replaced by a local mini-batch baseline based on multiple, possibly non-duplicate sample rollouts. By drawing multiple samples per training instance, we can learn faster and obtain a stable policy gradient estimator with significantly fewer instances. The proposed training algorithm outperforms the conventional greedy rollout baseline, even when combined with the maximum entropy objective.
翻译:本文研究了多起点团队定向问题(MSTOP),这是一个任务重规划问题,其中车辆初始位置远离基地且燃料量各不相同。我们考虑/假设多辆车的目标是在资源(如时间、燃料)消耗约束下,最大化所收集收益的总和。此类重规划问题广泛出现在智能无人机应用中,当任务环境变化迫使多辆车偏离原始计划时便会发生。为利用深度强化学习解决该问题,我们开发了一种策略网络,该网络对每条局部路径采用自注意力机制,并在局部路径与剩余节点之间应用编码器-解码器注意力。我们提出了一种改进的REINFORCE算法,其中贪婪rollout基线被替换为基于多个(可能非重复)样本rollout的局部小批量基线。通过从每个训练实例中抽取多个样本,我们能够更快地学习,并在使用显著更少实例的情况下获得稳定的策略梯度估计器。所提出的训练算法优于传统的贪婪rollout基线,即使结合最大熵目标函数,其表现也更为出色。