Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). However, even a precise knowledge of the value function $V^π$ corresponding to a policy $π$ does not provide reliable information on how far the policy $π$ is from the optimal one. We present a novel model-free upper value iteration procedure ({\sf UVIP}) that allows us to estimate the suboptimality gap $V^{\star}(x) - V^π(x)$ from above and to construct confidence intervals for \(V^\star\). Our approach relies on upper bounds to the solution of the Bellman optimality equation via the martingale approach. We provide theoretical guarantees for {\sf UVIP} under general assumptions and illustrate its performance on a number of benchmark RL problems.
翻译:策略评估是强化学习(RL)中比较不同算法的重要工具。然而,即使精确掌握了与策略$π$对应的价值函数$V^π$,也无法可靠地获知该策略$π$距离最优策略有多远。本文提出了一种新颖的无模型上界价值迭代过程({\sf UVIP}),使我们能够从上方估计次优性差距$V^{\star}(x) - V^π(x)$,并为\(V^\star\)构建置信区间。我们的方法基于鞅方法,通过求解贝尔曼最优方程的上界来实现。我们在一般性假设下为{\sf UVIP}提供了理论保证,并在多个基准RL问题上展示了其性能。