Autonomous Ground Vehicles (AGVs) are essential tools for a wide range of applications stemming from their ability to operate in hazardous environments with minimal human operator input. Effective motion planning is paramount for successful operation of AGVs. Conventional motion planning algorithms are dependent on prior knowledge of environment characteristics and offer limited utility in information poor, dynamically altering environments such as areas where emergency hazards like fire and earthquake occur, and unexplored subterranean environments such as tunnels and lava tubes on Mars. We propose a Deep Reinforcement Learning (DRL) framework for intelligent AGV exploration without a-priori maps utilizing Actor-Critic DRL algorithms to learn policies in continuous and high-dimensional action spaces directly from raw sensor data. The DRL architecture comprises feedforward neural networks for the critic and actor representations in which the actor network strategizes linear and angular velocity control actions given current state inputs, that are evaluated by the critic network which learns and estimates Q-values to maximize an accumulated reward. Three off-policy DRL algorithms, DDPG, TD3 and SAC, are trained and compared in two environments of varying complexity, and further evaluated in a third with no prior training or knowledge of map characteristics. The agent is shown to learn optimal policies at the end of each training period to chart quick, collision-free exploration trajectories, and is extensible, capable of adapting to an unknown environment without changes to network architecture or hyperparameters. The best algorithm is further evaluated in a realistic 3D environment.
翻译:自主地面车辆(AGVs)因其能在危险环境中以最小人工操作运行,已成为多种应用的关键工具。有效运动规划对于AGV的成功运行至关重要。传统运动规划算法依赖环境特征的先验知识,在信息匮乏、动态变化的环境中(如火灾、地震等紧急危险区域,以及火星隧道、熔岩管等未探索地下环境)实用性有限。我们提出一种无需先验地图的深度强化学习(DRL)框架,用于智能AGV探索,采用基于Actor-Critic的DRL算法直接从原始传感器数据中学习连续高维动作空间的策略。该DRL架构包含用于评价者和行动者表示的前馈神经网络:行动者网络根据当前状态输入制定线速度和角速度控制动作,由评价者网络通过学习并估计Q值以最大化累积奖励进行评估。我们训练并比较了三种离策略DRL算法(DDPG、TD3和SAC),在两种不同复杂度环境中进行训练对比,并在第三种无先验训练或已知地图特征的环境中进行评估。研究表明,智能体在每个训练周期结束时能学习到最优策略,规划出快速且无碰撞的探索轨迹,且具有可扩展性,能在不改变网络架构或超参数的情况下适应未知环境。最佳算法进一步在逼真的三维环境中进行了评估。