Learning how to reach goals in an environment is a longstanding challenge in AI, yet reasoning over long horizons remains a challenge for modern methods. The key question is how to estimate the temporal distance between pairs of observations. While temporal difference methods leverage local updates to provide optimality guarantees, they often perform worse than Monte Carlo methods that perform global updates (e.g., with multi-step returns), which lack such guarantees. We show how these approaches can be integrated into a practical offline GCRL method that fits a quasimetric distance using a multistep Monte-Carlo return. We show our method outperforms existing offline GCRL methods on long-horizon simulated tasks with up to 4000 steps, even with visual observations. We also demonstrate that our method can enable stitching in the real-world robotic manipulation domain (Bridge setup). Our approach is the first end-to-end offline GCRL method that enables multistep stitching in this real-world manipulation domain from an unlabeled offline dataset of visual observations and demonstrate robust horizon generalization.
翻译:学习如何在环境中达成目标是人工智能领域长期存在的挑战,然而现代方法在长时程推理方面仍面临困难。核心问题在于如何估计观测对之间的时间距离。虽然时序差分方法利用局部更新提供最优性保证,但其性能往往不及执行全局更新(例如使用多步回报)的蒙特卡洛方法,而后者缺乏此类保证。我们展示了如何将这些方法整合到一种实用的离线目标条件强化学习方法中,该方法使用多步蒙特卡洛回报来拟合拟度量距离。实验表明,在长达4000步的视觉观测长时程模拟任务中,我们的方法优于现有的离线目标条件强化学习方法。我们还证明了该方法能够在真实世界机器人操作领域(Bridge设置)中实现轨迹拼接。我们的方法是首个端到端的离线目标条件强化学习方法,能够基于未标注的视觉观测离线数据集,在此真实世界操作领域中实现多步轨迹拼接,并展现出鲁棒的时程泛化能力。