How to efficiently explore in reinforcement learning is an open problem. Many exploration algorithms employ the epistemic uncertainty of their own value predictions -- for instance to compute an exploration bonus or upper confidence bound. Unfortunately the required uncertainty is difficult to estimate in general with function approximation. We propose epistemic value estimation (EVE): a recipe that is compatible with sequential decision making and with neural network function approximators. It equips agents with a tractable posterior over all their parameters from which epistemic value uncertainty can be computed efficiently. We use the recipe to derive an epistemic Q-Learning agent and observe competitive performance on a series of benchmarks. Experiments confirm that the EVE recipe facilitates efficient exploration in hard exploration tasks.
翻译:在强化学习中如何高效探索仍是一个悬而未决的问题。许多探索算法利用其自身价值预测的认知不确定性——例如,计算探索奖励或置信上界。遗憾的是,在函数逼近的框架下,所需的不确定性通常难以估计。我们提出认知价值估计(Epistemic Value Estimation,EVE):一种兼容序列决策与神经网络函数逼近器的通用方法。该方法为智能体配备了一个关于其所有参数的可处理后验分布,从而能够高效计算认知价值不确定性。我们利用该框架推导出认知Q学习(Epistemic Q-Learning)智能体,并在系列基准任务中观察到其具有竞争力的表现。实验证实,EVE方法能有效促进困难探索任务中的高效探索。