Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL), aimed at efficiently assessing a policy's value function. TD($\lambda$), a potent variant, incorporates a memory trace to distribute the prediction error into the historical context. However, this approach often neglects the significance of historical states and the relative importance of propagating the TD error, influenced by challenges such as visitation imbalance or outcome noise. To address this, we propose a novel TD algorithm named discerning TD learning (DTD), which allows flexible emphasis functions$-$predetermined or adapted during training$-$to allocate efforts effectively across states. We establish the convergence properties of our method within a specific class of emphasis functions and showcase its promising potential for adaptation to deep RL contexts. Empirical results underscore that employing a judicious emphasis function not only improves value estimation but also expedites learning across diverse scenarios.
翻译:时序差分学习(TD)是强化学习(RL)中的基础概念,旨在有效评估策略的价值函数。其变体TD($\lambda$)通过引入记忆迹将预测误差分布到历史情境中。然而,该方法往往忽视历史状态的重要性以及TD误差传播的相对重要性,这受到访问不平衡或结果噪声等挑战的影响。为此,我们提出一种名为辨别时序差分学习(DTD)的新型TD算法,该算法允许灵活设置强调函数(可预定义或训练中自适应调整),从而跨状态有效分配学习资源。我们证明了在特定强调函数类别下该方法的收敛性质,并展示了其适应深度RL环境的潜力。实验结果表明,采用合理的强调函数不仅能改进价值估计,还能加速多种场景下的学习进程。