Autonomous exploration has many important applications. However, classic information gain-based or frontier-based exploration only relies on the robot current state to determine the immediate exploration goal, which lacks the capability of predicting the value of future states and thus leads to inefficient exploration decisions. This paper presents a method to learn how "good" states are, measured by the state value function, to provide a guidance for robot exploration in real-world challenging environments. We formulate our work as an off-policy evaluation (OPE) problem for robot exploration (OPERE). It consists of offline Monte-Carlo training on real-world data and performs Temporal Difference (TD) online adaptation to optimize the trained value estimator. We also design an intrinsic reward function based on sensor information coverage to enable the robot to gain more information with sparse extrinsic rewards. Results show that our method enables the robot to predict the value of future states so as to better guide robot exploration. The proposed algorithm achieves better prediction and exploration performance compared with the state-of-the-arts. To the best of our knowledge, this work for the first time demonstrates value function prediction on real-world dataset for robot exploration in challenging subterranean and urban environments. More details and demo videos can be found at https://jeffreyyh.github.io/opere/.
翻译:自主探索具有诸多重要应用场景。然而,传统基于信息增益或前沿的探索方法仅依赖机器人当前状态决定即时探索目标,缺乏对未来状态价值的预测能力,导致探索决策效率低下。本文提出一种通过状态价值函数量化状态"优劣"的方法,为机器人在真实挑战环境中的探索提供引导。我们将此项研究形式化为机器人探索中的离线策略评估(OPE)问题(OPERE),该方法包含基于真实世界数据的离线蒙特卡洛训练,并通过时序差分(TD)在线适应优化训练所得价值估计器。同时,我们设计了一种基于传感器信息覆盖的内在奖励函数,使机器人在稀疏外在奖励下能获取更多信息。实验结果表明,该方法使机器人能够预测未来状态价值,从而更优地引导探索行为。与现有最优方法相比,所提算法在预测能力和探索性能上均表现更优。据我们所知,本工作首次在真实世界数据集上验证了价值函数预测在具有挑战性的地下及城市环境中的机器人探索能力。更多详情与演示视频请访问 https://jeffreyyh.github.io/opere/。