Gradient temporal-difference (GTD) learning algorithms are widely used for off-policy policy evaluation with function approximation. However, existing convergence analyses rely on the restrictive assumption that the so-called feature interaction matrix (FIM) is nonsingular. In practice, the FIM can become singular and leads to instability or degraded performance. In this paper, we propose a regularized optimization objective by reformulating the mean-square projected Bellman error (MSPBE) minimization. This formulation naturally yields a regularized GTD algorithms, referred to as R-GTD, which guarantees convergence to a unique solution even when the FIM is singular. We establish theoretical convergence guarantees and explicit error bounds for the proposed method, and validate its effectiveness through empirical experiments.
翻译:梯度时序差分(GTD)学习算法在函数逼近下的离轨策略评估中得到了广泛应用。然而,现有的收敛性分析依赖于一个限制性假设,即所谓的特征交互矩阵(FIM)是非奇异的。在实践中,FIM可能变得奇异,从而导致算法不稳定或性能下降。本文通过重构均方投影贝尔曼误差(MSPBE)最小化问题,提出了一个正则化的优化目标。该公式自然地导出了一个正则化的GTD算法,称为R-GTD,该算法即使在FIM奇异的情况下也能保证收敛到唯一解。我们为所提出的方法建立了理论收敛性保证和显式误差界,并通过实证实验验证了其有效性。