Policy evaluation is an important instrument for the comparison of different algorithms in Reinforcement Learning (RL). Yet even a precise knowledge of the value function $V^{\pi}$ corresponding to a policy $\pi$ does not provide reliable information on how far is the policy $\pi$ from the optimal one. We present a novel model-free upper value iteration procedure $({\sf UVIP})$ that allows us to estimate the suboptimality gap $V^{\star}(x) - V^{\pi}(x)$ from above and to construct confidence intervals for $V^\star$. Our approach relies on upper bounds to the solution of the Bellman optimality equation via martingale approach. We provide theoretical guarantees for ${\sf UVIP}$ under general assumptions and illustrate its performance on a number of benchmark RL problems.
翻译:策略评估是强化学习(RL)中比较不同算法的重要工具。然而,即使精确已知与策略 $\pi$ 对应的价值函数 $V^{\pi}$,也无法可靠地获知该策略 $\pi$ 距离最优策略有多远。本文提出了一种新颖的无模型上界价值迭代过程 $({\sf UVIP})$,它允许我们从上方估计次优性差距 $V^{\star}(x) - V^{\pi}(x)$,并为 $V^\star$ 构建置信区间。我们的方法通过鞅方法,依赖于贝尔曼最优方程解的上界。我们在一般性假设下为 ${\sf UVIP}$ 提供了理论保证,并在多个基准 RL 问题上展示了其性能。