Off-policy learning enables a reinforcement learning (RL) agent to reason counterfactually about policies that are not executed and is one of the most important ideas in RL. It, however, can lead to instability when combined with function approximation and bootstrapping, two arguably indispensable ingredients for large-scale reinforcement learning. This is the notorious deadly triad. Gradient Temporal Difference (GTD) is one powerful tool to solve the deadly triad. Its success results from solving a doubling sampling issue indirectly with weight duplication or Fenchel duality. In this paper, we instead propose a direct method to solve the double sampling issue by simply using two samples in a Markovian data stream with an increasing gap. The resulting algorithm is as computationally efficient as GTD but gets rid of GTD's extra weights. The only price we pay is a logarithmically increasing memory as time progresses. We provide both asymptotic and finite sample analysis, where the convergence rate is on-par with the canonical on-policy temporal difference learning. Key to our analysis is a novel refined discretization of limiting ODEs.
翻译:离策略学习使强化学习(RL)智能体能够对未执行的策略进行反事实推理,是强化学习中最关键的思想之一。然而,当与函数逼近和自举法(大规模强化学习中两个不可或缺的要素)结合时,可能导致不稳定。这就是臭名昭著的“致命三元组”。梯度时序差分(GTD)是解决该问题的强大工具,其成功源于通过权重复制或Fenchel对偶性间接解决了双重采样问题。本文提出了一种直接方法,仅通过在马尔可夫数据流中采用间隔递增的两个样本来解决双重采样问题。所得算法在计算效率上与GTD相当,但消除了GTD的额外权重。我们付出的唯一代价是随时间以对数方式增长的内存需求。我们提供了渐近分析和有限样本分析,收敛速率与标准的策略时序差分学习持平。分析的关键在于对极限常微分方程提出了一种新颖的精细离散化方法。