Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.
翻译:自主水下航行器传统上依赖复杂且需大量人工设计的感知、路径规划与运动控制流水线。本文探索了端到端深度强化学习方法的可行性,该方法将原始传感器数据直接映射至推进器指令,从而减少人工工程干预。我们提出了一种分层强化学习架构,将问题分解为两个马尔可夫决策过程。在2Hz频率运行的高层策略处理原始$84 \times 84$像素单目相机帧、堆叠的$100 \times 100$像素前视成像声纳及本体感知数据,以生成空间子目标;同时在10Hz频率运行的低层策略将这些子目标转换为推进器指令。高层策略采用基于先前演示的强化学习,在改进的样本高效机器人强化学习框架内进行训练,而低层策略则结合了软演员-评论家算法与事后经验回放。在高保真HoloOcean模拟器中的评估表明,我们的方法成功实现了避障,其轨迹长度与$\text{RRT}^*$规划基线相比误差在4%至6%以内。此外,学习得到的策略在模拟传感器噪声和能见度降低条件下展现出强鲁棒性。尽管系统能有效导航熟悉几何环境,实验揭示其在遭遇含新型障碍物形状的未访问区域时存在泛化局限。最终,本工作证明了样本高效的端到端深度强化学习在利用最小计算硬件实现水下导航方面的潜力。