A common technique in reinforcement learning is to evaluate the value function from Monte Carlo simulations of a given policy, and use the estimated value function to obtain a new policy which is greedy with respect to the estimated value function. A well-known longstanding open problem in this context is to prove the convergence of such a scheme when the value function of a policy is estimated from data collected from a single sample path obtained from implementing the policy (see page 99 of [Sutton and Barto, 2018], page 8 of [Tsitsiklis, 2002]). We present a solution to the open problem by showing that a first-visit version of such a policy iteration scheme indeed converges to the optimal policy provided that the policy improvement step uses lookahead [Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b] rather than a simple greedy policy improvement. We provide results both for the original open problem in the tabular setting and also present extensions to the function approximation setting, where we show that the policy resulting from the algorithm performs close to the optimal policy within a function approximation error.
翻译:强化学习中的一种常见技术是利用给定策略的蒙特卡洛仿真来评估价值函数,并基于估计的价值函数构建贪心策略。一个长期存在的公开难题是:当仅通过执行策略获取的单条样本路径收集数据来估计价值函数时,如何证明此类方案的收敛性(见[Sutton and Barto, 2018]第99页、[Tsitsiklis, 2002]第8页)。本文通过证明策略迭代方案的首访版本确实收敛至最优策略,为该难题提供了解答——前提是策略改进步骤采用前向搜索[Silver et al., 2016, Mnih et al., 2016, Silver et al., 2017b]而非简单贪心策略。我们不仅针对表格型设置中的原始公开问题给出结果,还将结论拓展至函数逼近场景:在此场景下,算法生成的策略性能与最优策略的差距在函数逼近误差范围内。