Reinforcement learning in sparse-reward navigation environments with expensive and limited interactions is challenging and poses a need for effective exploration. Motivated by complex navigation tasks that require real-world training (when cheap simulators are not available), we consider an agent that faces an unknown distribution of environments and must decide on an exploration strategy. It may leverage a series of training environments to improve its policy before it is evaluated in a test environment drawn from the same environment distribution. Most existing approaches focus on fixed exploration strategies, while the few that view exploration as a meta-optimization problem tend to ignore the need for cost-efficient exploration. We propose a cost-aware Bayesian optimization approach that efficiently searches over a class of dynamic subgoal-based exploration strategies. The algorithm adjusts a variety of levers -- the locations of the subgoals, the length of each episode, and the number of replications per trial -- in order to overcome the challenges of sparse rewards, expensive interactions, and noise. An experimental evaluation demonstrates that the new approach outperforms existing baselines across a number of problem domains. We also provide a theoretical foundation and prove that the method asymptotically identifies a near-optimal subgoal design.
翻译:在稀疏奖励导航环境中,强化学习面临交互成本高昂且次数有限的挑战,亟需高效的探索策略。受现实世界训练需求(无法使用廉价模拟器)的复杂导航任务启发,我们考虑一个面对未知环境分布、需要决策探索策略的智能体。该智能体可借助一系列训练环境改进策略,随后在来自相同环境分布的测试环境中进行评估。现有方法大多聚焦于固定探索策略,而少数将探索视为元优化问题的方法往往忽略了成本效益的重要性。本文提出一种成本感知的贝叶斯优化方法,可高效搜索一类基于动态子目标的探索策略。该算法通过调整子目标位置、单幕时长及每轮重复次数等多维杠杆,克服了稀疏奖励、交互成本高昂和噪声干扰等挑战。实验评估表明,新方法在多个问题领域均优于现有基线。我们还提供了理论基础,证明该方法能渐进地识别出近乎最优的子目标设计方案。