Tsitsiklis proved convergence of Monte Carlo optimistic policy iteration under a uniform update structure and identified nonuniform update frequencies as a delicate obstruction. We give a certified negative answer for the natural scalar-stepsize, unnormalized asynchronous state-value recursion with fixed nonuniform state-selection probabilities. In a three-state, two-action discounted MDP, the nonuniform update frequencies induce a diagonally scaled greedy-policy mean field with a certified nonconstant attracting hybrid periodic orbit. With a bounded unbiased geometric-horizon estimator and Robbins--Monro stepsizes, the original stochastic recursion remains trapped near the cycle with positive probability and therefore fails to converge. The example pinpoints a geometric obstruction: uniform sampling gives radial residual contraction, whereas scalar nonuniform sampling anisotropically distorts the residual dynamics and can generate switched attracting cycles.
翻译:Tsitsiklis证明了在均匀更新结构下蒙特卡罗乐观策略迭代的收敛性,并指出非均匀更新频率是一个精妙的障碍。我们针对自然标量步长、非归一化异步状态值递归(采用固定非均匀状态选择概率)给出了经认证的否定答案。在一个三状态、两动作的折扣MDP中,非均匀更新频率诱导出一个对角缩放贪心策略平均场,该平均场具有经认证的非恒定吸引混合周期轨道。采用有界无偏几何视界估计量与Robbins-Monro步长,原始随机递归以正概率持续困在该循环附近,因此无法收敛。该例子揭示了一个几何障碍:均匀采样产生径向残差收缩,而标量非均匀采样各向异性地扭曲残差动态,可能生成切换型吸引周期轨道。