Reinforcement learning methods, while effective for learning robotic navigation strategies, are known to be highly sample inefficient. This sample inefficiency comes in part from not suitably balancing the explore-exploit dilemma, especially in the presence of non-stationarity, during policy optimization. To incorporate a balance of exploration-exploitation for sample efficiency, we propose Ada-NAV, an adaptive trajectory length scheme where the length grows as a policy's randomness, represented by its Shannon or differential entropy, decreases. Our adaptive trajectory length scheme emphasizes exploration at the beginning of training due to more frequent gradient updates and emphasizes exploitation later on with longer trajectories. In gridworld, simulated robotic environments, and real-world robotic experiments, we demonstrate the merits of the approach over constant and randomly sampled trajectory lengths in terms of performance and sample efficiency. For a fixed sample budget, Ada-NAV results in an 18% increase in navigation success rate, a 20-38% decrease in the navigation path length, and 9.32% decrease in the elevation cost compared to the policies obtained by the other methods. We also demonstrate that Ada-NAV can be transferred and integrated into a Clearpath Husky robot without significant performance degradation.
翻译:摘要:强化学习方法虽能有效学习机器人导航策略,但其样本效率低下是公认的难题。这种低效部分源于策略优化过程中未能妥善平衡探索-利用困境,尤其在面对非平稳性环境时更为突出。为在样本效率中实现探索与利用的平衡,我们提出Ada-NAV——一种自适应轨迹长度方案,其轨迹长度随策略随机性(以香农熵或微分熵表征)的降低而动态增长。该方案在训练初期通过更频繁的梯度更新强化探索,后期则借助长轨迹提升利用效率。在网格世界、仿真机器人环境及真实机器人实验中,我们验证了该方法相较于固定轨迹长度与随机采样轨迹长度在性能与样本效率上的优势。在固定样本预算下,Ada-NAV使导航成功率提升18%,导航路径长度缩短20%-38%,高程代价降低9.32%,均优于对比方法所得策略。此外,我们证明Ada-NAV可迁移至Clearpath Husky机器人平台且性能无显著衰减。