We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. In particular, we focus on characterizing the variance over values induced by a distribution over MDPs. Previous work upper bounds the posterior variance over values by solving a so-called uncertainty Bellman equation (UBE), but the over-approximation may result in inefficient exploration. We propose a new UBE whose solution converges to the true posterior variance over values and leads to lower regret in tabular exploration problems. We identify challenges to apply the UBE theory beyond tabular problems and propose a suitable approximation. Based on this approximation, we introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC), that can be applied for either risk-seeking or risk-averse policy optimization with minimal changes. Experiments in both online and offline RL demonstrate improved performance compared to other uncertainty estimation methods.
翻译:我们研究基于模型的强化学习中对期望累积回报的不确定性量化问题。特别地,我们聚焦于由马尔可夫决策过程分布诱导的值方差的刻画。先前工作通过求解所谓的不确定性贝尔曼方程(UBE)来给出值的后验方差上界,但这种过近似可能导致低效探索。我们提出一种新的UBE,其解收敛于值的真实后验方差,从而在表格探索问题中更低的遗憾。我们识别出将UBE理论推广至非表格问题时面临的挑战,并提出一种合适的近似方法。基于该近似,我们引入通用型策略优化算法——不确定性软演员-评论家算法(QU-SAC),其通过极小的改动即可应用于风险寻求或风险规避策略优化。在线和离线强化学习实验均表明,与其他不确定性估计方法相比,该算法性能更优。