Maritime surveillance missions, such as search and rescue and environmental monitoring, rely on the efficient allocation of sensing assets over vast and geometrically complex areas. Traditional Coverage Path Planning (CPP) approaches depend on decomposition techniques that struggle with irregular coastlines, islands, and exclusion zones, or require computationally expensive re-planning for every instance. We propose a Deep Reinforcement Learning (DRL) framework to solve CPP on hexagonal grid representations of irregular maritime areas. Unlike conventional methods, we formulate the problem as a neural combinatorial optimization task where a Transformer-based pointer policy autoregressively constructs coverage tours. To overcome the instability of value estimation in long-horizon routing problems, we implement a critic-free Group-Relative Policy Optimization (GRPO) scheme. This method estimates advantages through within-instance comparisons of sampled trajectories rather than relying on a value function. Experiments on 1,000 unseen synthetic maritime environments demonstrate that a trained policy achieves a 99.0% Hamiltonian success rate, more than double the best heuristic (46.0%), while producing paths 7% shorter and with 24% fewer heading changes than the closest baseline. All three inference modes (greedy, stochastic sampling, and sampling with 2-opt refinement) operate under 50~ms per instance on a laptop GPU, confirming feasibility for real-time on-board deployment.
翻译:海事监测任务,如搜索救援和环境监控,依赖于在广阔且几何形状复杂的区域中高效分配感知资产。传统的覆盖路径规划(CPP)方法依赖于分解技术,难以处理不规则海岸线、岛屿和禁航区,或需要对每种实例进行计算成本高昂的重新规划。我们提出一种深度强化学习(DRL)框架,用于解决不规则海事区域六边形网格表示上的CPP问题。与传统方法不同,我们将该问题表述为神经组合优化任务,其中基于Transformer的指针策略自回归地构建覆盖路径。为克服长视界路径规划问题中价值估计的不稳定性,我们实现了一种无评论组相对策略优化(GRPO)方案。该方法通过比较实例内采样轨迹的相对表现来估计优势,而非依赖价值函数。在1,000个未见过的合成海事环境上的实验表明,训练后的策略达到了99.0%的哈密顿成功率,是最优启发式算法(46.0%)的两倍以上,同时路径长度比最接近的基准方法短7%,航向变化减少24%。所有三种推理模式(贪心搜索、随机采样和带2-opt优化的采样)在笔记本GPU上对每个实例的处理时间均低于50毫秒,验证了其适用于实时机载部署的可行性。