We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation, but the over-approximation may result in inefficient exploration. We propose a new uncertainty Bellman equation whose solution converges to the true posterior variance over values and explicitly characterizes the gap in previous work. Moreover, our uncertainty quantification technique is easily integrated into common exploration strategies and scales naturally beyond the tabular setting by using standard deep reinforcement learning architectures. Experiments in difficult exploration tasks, both in tabular and continuous control settings, show that our sharper uncertainty estimates improve sample-efficiency.
翻译:我们研究了在基于模型的强化学习中量化期望累积奖励不确定性的问题。具体而言,我们聚焦于刻画由马尔可夫决策过程(MDP)分布所诱导的价值方差特征。先前的工作通过求解所谓的“不确定性贝尔曼方程”来给出价值后验方差的上界,但这种过度近似可能导致探索效率低下。我们提出了一种新的不确定性贝尔曼方程,其解收敛于真实的价值后验方差,并明确指出了先前工作中的偏差。此外,我们的不确定性量化技术能够轻松集成到常见的探索策略中,并通过使用标准深度强化学习架构自然地扩展到表格环境之外的场景。在表格型和连续控制设置下的困难探索任务实验中表明,我们更精确的不确定性估计提升了样本效率。