Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, but counteracting off-policy bias without exacerbating variance is challenging. Classically, off-policy bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio after each action via eligibility traces. Many off-policy algorithms rely on this mechanism, along with differing protocols for cutting the IS ratios to combat the variance of the IS estimator. Unfortunately, once a trace has been fully cut, the effect cannot be reversed. This has led to the development of credit-assignment strategies that account for multiple past experiences at a time. These trajectory-aware methods have not been extensively analyzed, and their theoretical justification remains uncertain. In this paper, we propose a multistep operator that can express both per-decision and trajectory-aware methods. We prove convergence conditions for our operator in the tabular setting, establishing the first guarantees for several existing methods as well as many new ones. Finally, we introduce Recency-Bounded Importance Sampling (RBIS), which leverages trajectory awareness to perform robustly across $\lambda$-values in an off-policy control task.
翻译:从多步回报中进行离线策略学习对于样本高效的强化学习至关重要,但纠正离线策略偏差而不加剧方差则具有挑战性。传统上,离线策略偏差以逐决策方式纠正:通过资格迹,在每个动作后利用瞬时重要性采样比率重新加权过去的时间差分误差。许多离线策略算法依赖此机制,并采用不同的协议来截断IS比率以对抗IS估计器的方差。不幸的是,一旦迹被完全截断,其影响便无法逆转。这导致了能够同时考虑多个过去经验的信用分配策略的发展。这些轨迹感知方法尚未被广泛分析,其理论依据仍不确定。本文提出了一种多步算子,能够同时表达逐决策方法和轨迹感知方法。我们在表格设置下证明了该算子的收敛条件,为若干现有方法及许多新方法建立了首个保证。最后,我们引入近期边界重要性采样(RBIS),该算法利用轨迹感知特性,在离线策略控制任务中跨λ值实现鲁棒性能。