Poisson equations underpin average-reward reinforcement learning, but beyond ergodicity they can be ill-posed, meaning that solutions are non-unique and standard fixed point iterations can oscillate on reducible or periodic chains. We study finite-state Markov chains with $n$ states and transition matrix $P$. We show that all non-decaying modes are captured by a real peripheral invariant subspace $\mathcal{K}(P)$, and that the induced operator on the quotient space $\mathbb{R}^n/\mathcal{K}(P)$ is strictly contractive, yielding a unique quotient solution. Building on this viewpoint, we develop an end-to-end pipeline that learns the chain structure, estimates an anchor based gauge map, and runs projected stochastic approximation to estimate a gauge-fixed representative together with an associated peripheral residual. We prove $\widetilde{O}(T^{-1/2})$ convergence up to projection estimation error, enabling stable Poisson equation learning for multichain and periodic regimes with applications to performance evaluation of average-reward reinforcement learning beyond ergodicity.
翻译:泊松方程是平均奖励强化学习的理论基础,但在遍历性条件之外,该方程可能不适定,即解不唯一且标准不动点迭代在可约或周期性链上可能振荡。我们研究具有 $n$ 个状态和转移矩阵 $P$ 的有限状态马尔可夫链。我们证明所有非衰减模态均由实外围不变子空间 $\mathcal{K}(P)$ 捕获,且商空间 $\mathbb{R}^n/\mathcal{K}(P)$ 上的诱导算子是严格压缩的,从而产生唯一的商解。基于这一观点,我们开发了一个端到端流程:学习链结构、估计基于锚点的规范映射,并运行投影随机逼近以估计规范固定代表元及其相关的外围残差。我们证明了在投影估计误差范围内具有 $\widetilde{O}(T^{-1/2})$ 收敛速度,从而为多链和周期性机制实现了稳定的泊松方程学习,并将其应用于超越遍历性的平均奖励强化学习性能评估。