Reinforcement learning methods, while effective for learning robotic navigation strategies, are known to be highly sample inefficient. This sample inefficiency comes in part from not suitably balancing the explore-exploit dilemma, especially in the presence of non-stationarity, during policy optimization. To incorporate a balance of exploration-exploitation for sample efficiency, we propose Ada-NAV, an adaptive trajectory length scheme where the length grows as a policy's randomness, represented by its Shannon or differential entropy, decreases. Our adaptive trajectory length scheme emphasizes exploration at the beginning of training due to more frequent gradient updates and emphasizes exploitation later on with longer trajectories. In gridworld, simulated robotic environments, and real-world robotic experiments, we demonstrate the merits of the approach over constant and randomly sampled trajectory lengths in terms of performance and sample efficiency. For a fixed sample budget, Ada-NAV results in an 18% increase in navigation success rate, a 20-38% decrease in the navigation path length, and 9.32% decrease in the elevation cost compared to the policies obtained by the other methods. We also demonstrate that Ada-NAV can be transferred and integrated into a Clearpath Husky robot without significant performance degradation.
翻译:强化学习方法虽然能有效学习机器人导航策略,但存在严重的样本效率低下问题。这种低效率部分源于策略优化过程中未能恰当平衡探索与利用的困境,尤其在面对非平稳性时尤为突出。为实现探索与利用的平衡以提高样本效率,我们提出Ada-NAV自适应轨迹长度方案,该方案中轨迹长度随策略随机性(由香农熵或微分熵表征)降低而增长。该自适应轨迹长度方案在训练初期通过更频繁的梯度更新侧重探索,后期则通过更长轨迹侧重利用。在网格世界、仿真机器人环境及真实机器人实验中,我们证明了该方法相较于固定轨迹长度和随机采样轨迹长度在性能与样本效率上的优势。在固定样本预算下,相较于其他方法获得的策略,Ada-NAV使导航成功率提升18%,导航路径长度降低20-38%,高程成本降低9.32%。我们还证明了Ada-NAV可迁移至Clearpath Husky机器人且性能未出现显著下降。