Q-learning and SARSA are foundational reinforcement learning algorithms whose practical success depends critically on step-size calibration. Step-sizes that are too large can cause numerical instability, while step-sizes that are too small can lead to slow progress. We propose implicit variants of Q-learning and SARSA that reformulate their iterative updates as fixed-point equations. This yields an adaptive step-size adjustment that scales inversely with feature norms, providing automatic regularization without manual tuning. Our non-asymptotic analyses demonstrate that implicit methods maintain stability over significantly broader step-size ranges. Under favorable conditions, it permits arbitrarily large step-sizes while achieving comparable convergence rates. Empirical validation across benchmark environments spanning discrete and continuous state spaces shows that implicit Q-learning and SARSA exhibit substantially reduced sensitivity to step-size selection, achieving stable performance with step-sizes that would cause standard methods to fail.
翻译:Q学习与SARSA作为强化学习的奠基性算法,其实际应用效果高度依赖于步长校准。过大的步长会导致数值不稳定,而过小的步长则会造成收敛缓慢。本文提出Q学习与SARSA的隐式变体,将其迭代更新重构为不动点方程。该方法通过特征范数的倒数自适应调整步长,实现了无需人工调参的自动正则化。我们的非渐近分析表明,隐式方法能在显著更宽的步长范围内保持稳定性。在有利条件下,该方法允许使用任意大的步长,同时获得与传统方法相当的收敛速率。在涵盖离散与连续状态空间的基准环境中的实证验证表明,隐式Q学习与SARSA对步长选择的敏感性显著降低,能够在导致标准方法失效的步长设置下保持稳定性能。