The asymptotic behaviour of Monte Carlo Exploring Starts (MCES) is a long-standing open question in reinforcement learning, even in the tabular setting. We investigated the convergence properties of tabular MCES by constructing examples in which the algorithm converges to suboptimal solutions. This paper presents new counterexamples for both initial-visit and first-visit MCES and gives a convergence-restoring modification for the initial-visit case. We show that stable suboptimal solutions may exist for initial-visit MCES with sample-average updates even when greedy actions are updated more often than non-greedy actions on average. However, by scaling learning rates inversely to update frequencies on a state-by-state basis, convergence to optimality is guaranteed. Unlike previous uniformisation methods, this modification is applicable to large-scale problems that require approximating the estimated value function. We then extend the example to show that sample-average first-visit MCES may also converge to suboptimal solutions. This largely settles a fundamental open problem and shows that exploring starts alone do not guarantee convergence to optimality. More broadly, these results highlight that convergence depends critically on the relative size and frequency of updates applied to different actions, making the choice of learning rates and the balance between exploration and exploitation central to the analysis of MCES and the implementation of scalable Monte Carlo control methods.
翻译:蒙特卡洛探索起点(MCES)的渐近行为是强化学习中一个长期未解决的开放问题,即使在表格设置中也是如此。我们通过构造算法收敛至次优解的例子,研究了表格型MCES的收敛性质。本文为初始访问和首次访问MCES均提出了新的反例,并为初始访问情况给出了恢复收敛性的修正方案。我们证明,即使当贪心行动平均更新频率高于非贪心行动时,采用样本均值更新的初始访问MCES仍可能存在稳定的次优解。然而,通过按状态尺度将学习率与更新频率成反比缩放,可保证收敛至最优性。与先前的均匀化方法不同,此修正适用于需要近似估计值函数的大规模问题。随后,我们扩展该例子表明样本均值首次访问MCES也可能收敛至次优解。这基本解决了一个基本开放问题,并表明仅依靠探索起点无法保证收敛至最优性。更广泛地,这些结果强调收敛性关键取决于不同行动所应用更新的相对规模与频率,使得学习率的选择以及探索与利用的平衡成为MCES分析与可扩展蒙特卡洛控制方法实现的核心。