Learning from multi-step off-policy data collected by a set of policies is a core problem of reinforcement learning (RL). Approaches based on importance sampling (IS) often suffer from large variances due to products of IS ratios. Typical IS-free methods, such as $n$-step Q-learning, look ahead for $n$ time steps along the trajectory of actions (where $n$ is called the lookahead depth) and utilize off-policy data directly without any additional adjustment. They work well for proper choices of $n$. We show, however, that such IS-free methods underestimate the optimal value function (VF), especially for large $n$, restricting their capacity to efficiently utilize information from distant future time steps. To overcome this problem, we introduce a novel, IS-free, multi-step off-policy method that avoids the underestimation issue and converges to the optimal VF. At its core lies a simple but non-trivial \emph{highway gate}, which controls the information flow from the distant future by comparing it to a threshold. The highway gate guarantees convergence to the optimal VF for arbitrary $n$ and arbitrary behavioral policies. It gives rise to a novel family of off-policy RL algorithms that safely learn even when $n$ is very large, facilitating rapid credit assignment from the far future to the past. On tasks with greatly delayed rewards, including video games where the reward is given only at the end of the game, our new methods outperform many existing multi-step off-policy algorithms.
翻译:从一组策略收集的多步离策略数据中学习是强化学习(RL)的核心问题。基于重要性采样(IS)的方法常因IS比值的乘积而遭受高方差困扰。典型的无IS方法,例如$n$步Q学习,沿着动作轨迹向前看$n$个时间步(其中$n$称为前瞻深度),并直接利用离策略数据而无需任何额外调整。当$n$选择适当时,这些方法效果良好。然而,我们证明此类无IS方法会低估最优值函数(VF),尤其当$n$较大时,这限制了其有效利用遥远未来时间步信息的能力。为克服此问题,我们提出一种新颖的无IS多步离策略方法,它避免了低估问题并收敛至最优VF。其核心是一个简单但非平凡的\emph{高速公路门控}机制,该机制通过将遥远未来的信息与阈值进行比较来控制信息流。高速公路门控保证了对任意$n$和任意行为策略均收敛至最优VF。它催生了一类新型离策略RL算法,即使当$n$非常大时也能安全学习,从而促进从遥远未来到过去的快速信用分配。在具有极大延迟奖励的任务中,包括仅在游戏结束时给予奖励的视频游戏,我们的新方法优于许多现有的多步离策略算法。