Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this idea and enabling sample-efficient reinforcement learning. However, existing methods overlook a crucial aspect: the need for optimism to be informed by a belief connecting the reward and state. To address this, we propose a practical, theoretically grounded approach to optimistic exploration based on Thompson sampling. Our model structure is the first that allows for reasoning about joint uncertainty over transitions and rewards. We apply our method on a set of MuJoCo and VMAS continuous control tasks. Our experiments demonstrate that optimistic exploration significantly accelerates learning in environments with sparse rewards, action penalties, and difficult-to-explore regions. Furthermore, we provide insights into when optimism is beneficial and emphasize the critical role of model uncertainty in guiding exploration.
翻译:通过与环境的交互学习复杂机器人行为需要遵循原则性的探索策略。有效的探索方法应优先考虑探索那些能最大化奖励的状态-动作空间区域,而乐观探索作为与此理念相符且能实现样本高效强化学习的方向,展现出广阔前景。然而,现有方法忽略了一个关键层面:乐观性需要建立在连接奖励与状态的信念基础上。为解决这一问题,我们提出了一种基于汤普森采样的、具有理论依据的实用型乐观探索方法。我们的模型结构首次实现了对状态转移与奖励联合不确定性的推理。我们在MuJoCo和VMAS连续控制任务集上验证了该方法。实验表明,在奖励稀疏、存在动作惩罚及难以探索区域的环境中,乐观探索能显著加速学习进程。此外,我们深入分析了乐观策略的适用场景,并强调了模型不确定性在引导探索中的关键作用。