Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motivation, maximizing a composite objective of extrinsic and intrinsic rewards. We suggest that this approach incurs unnecessary overhead: while policy optimization is necessary for precise task execution, employing such machinery solely to expand state coverage may be inefficient. In this paper, we propose a new paradigm that explicitly separates exploration from exploitation and bypasses RL during the exploration phase. Our method uses a tree-search strategy inspired by the Go-With-The-Winner algorithm, paired with a measure of epistemic uncertainty to systematically drive exploration. By removing the overhead of policy optimization, our approach explores an order of magnitude more efficiently than standard intrinsic motivation baselines on hard Atari benchmarks. Further, we demonstrate that the discovered trajectories can be distilled into deployable policies using existing supervised backward learning algorithms, achieving state-of-the-art scores by a wide margin on Montezuma's Revenge, Pitfall!, and Venture without relying on domain-specific knowledge. Finally, we demonstrate the generality of our framework in high-dimensional continuous action spaces by solving the MuJoCo Adroit dexterous manipulation and AntMaze tasks in a sparse-reward setting, directly from image observations and without expert demonstrations or offline datasets. To the best of our knowledge, this has not been achieved before for the Adroit tasks.

翻译：发现过程需要主动探索——即收集新颖且信息丰富的数据。然而，高效的自主探索仍是一个尚未解决的主要难题。主流范式通过使用强化学习训练具有内在动机的智能体来应对这一挑战，最大化由外在奖励和内在奖励组成的复合目标。我们认为，这种方法带来了不必要的开销：虽然策略优化对于精确执行任务是必要的，但仅为了扩展状态覆盖范围而采用此类机制可能效率低下。本文提出了一种新范式，明确将探索与利用分离，并在探索阶段绕过强化学习。我们的方法采用受“与赢家同行”算法启发的树搜索策略，并结合认知不确定性度量来系统性地驱动探索。通过消除策略优化的开销，该方法在困难的Atari基准测试中探索效率比标准的内在动机基线高出约一个数量级。此外，我们证明，发现的状态-动作轨迹可通过现有的监督反向学习算法提炼为可部署策略，在《蒙特祖玛的复仇》《陷阱！》和《冒险》等游戏中，无需依赖领域特定知识即可大幅超越现有最优得分。最后，我们通过在高维连续动作空间中的实验展示了框架的通用性：在稀疏奖励设置下，直接从图像观测出发，无需专家演示或离线数据集，成功解决了MuJoCo Adroit灵巧操作和AntMaze任务。据我们所知，这在Adroit任务中尚属首次实现。