We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation, but the over-approximation may result in inefficient exploration. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values and explicitly characterizes the gap in previous work. Moreover, our uncertainty quantification technique is easily integrated into common exploration strategies and scales naturally beyond the tabular setting by using standard deep reinforcement learning architectures. Experiments in difficult exploration tasks, both in tabular and continuous control settings, show that our sharper uncertainty estimates improve sample-efficiency.
翻译:我们考虑在基于模型的强化学习中量化预期累计回报的不确定性问题。具体而言,我们关注于刻画由马尔可夫决策过程(MDP)分布所诱导的价值方差。先前的研究通过求解所谓的不确定性贝尔曼方程来给出价值后验方差的上界,但这种过近似可能导致探索效率低下。我们提出了一种新的不确定性贝尔曼方程,其解收敛于价值真实后验方差,并明确刻画了先前研究中的差距。此外,我们的不确定性量化技术易于集成到常见的探索策略中,并通过使用标准深度强化学习架构自然扩展到表格设置之外。在表格和连续控制场景的困难探索任务实验中表明,我们更精确的不确定性估计提高了样本效率。