Variance reduction techniques have been successfully applied to temporal-difference (TD) learning and help to improve the sample complexity in policy evaluation. However, the existing work applied variance reduction to either the less popular one time-scale TD algorithm or the two time-scale GTD algorithm but with a finite number of i.i.d.\ samples, and both algorithms apply to only the on-policy setting. In this work, we develop a variance reduction scheme for the two time-scale TDC algorithm in the off-policy setting and analyze its non-asymptotic convergence rate over both i.i.d.\ and Markovian samples. In the i.i.d.\ setting, our algorithm {matches the best-known lower bound $\tilde{O}(\epsilon^{-1}$).} In the Markovian setting, our algorithm achieves the state-of-the-art sample complexity $O(\epsilon^{-1} \log {\epsilon}^{-1})$ that is near-optimal. Experiments demonstrate that the proposed variance-reduced TDC achieves a smaller asymptotic convergence error than both the conventional TDC and the variance-reduced TD.
翻译:方差缩减技术已成功应用于时序差分(TD)学习,并有助于改善策略评估中的样本复杂度。然而,现有工作要么将方差缩减应用于不太流行的单时间尺度TD算法,要么应用于两时间尺度GTD算法但仅局限于有限数量的独立同分布样本,且这两种算法仅适用于同策略设置。在本工作中,我们针对离策略设置下的两时间尺度TDC算法开发了一种方差缩减方案,并分析了其在独立同分布样本和马尔可夫样本上的非渐近收敛率。在独立同分布设置下,我们的算法匹配了已知最优下界$\tilde{O}(\epsilon^{-1}$)。在马尔可夫设置下,我们的算法实现了最先进的样本复杂度$O(\epsilon^{-1} \log {\epsilon}^{-1})$,该结果接近最优。实验表明,所提出的方差缩减TDC比传统TDC和方差缩减TD均取得了更小的渐近收敛误差。