Diagnosing Non-Markovian Observations in Reinforcement Learning via Prediction-Based Violation Scoring

Reinforcement learning algorithms assume that observations satisfy the Markov property, yet real-world sensors frequently violate this assumption through correlated noise, latency, or partial observability. Standard performance metrics conflate Markov breakdowns with other sources of suboptimality, leaving practitioners without diagnostic tools for such violations. This paper introduces a prediction-based scoring method that quantifies non-Markovian structure in observation trajectories. A random forest first removes nonlinear Markov-compliant dynamics; ridge regression then tests whether historical observations reduce prediction error on the residuals beyond what the current observation provides. The resulting score is bounded in [0, 1] and requires no causal graph construction. Evaluation spans six environments (CartPole, Pendulum, Acrobot, HalfCheetah, Hopper, Walker2d), three algorithms (PPO, A2C, SAC), controlled AR(1) noise at six intensity levels, and 10 seeds per condition. In post-hoc detection, 7 of 16 environment-algorithm pairs, primarily high-dimensional locomotion tasks, show significant positive monotonicity between noise intensity and the violation score (Spearman rho up to 0.78, confirmed under repeated-measures analysis); under training-time noise, 13 of 16 pairs exhibit statistically significant reward degradation. An inversion phenomenon is documented in low-dimensional environments where the random forest absorbs the noise signal, causing the score to decrease as true violations grow, a failure mode analyzed in detail. A practical utility experiment demonstrates that the proposed score correctly identifies partial observability and guides architecture selection, fully recovering performance lost to non-Markovian observations. Source code to reproduce all results is provided at https://github.com/NAVEENMN/Markovianes.

翻译：强化学习算法假设观测满足马尔可夫性质，但现实世界中的传感器常通过相关噪声、延迟或部分可观测性违反这一假设。标准性能指标将马尔可夫性失效与其他次优性来源混为一谈，使从业者缺乏诊断此类违背的工具。本文提出一种基于预测的评分方法，用于量化观测轨迹中的非马尔可夫结构。该方法首先利用随机森林去除非线性马尔可夫合规动力学；随后通过岭回归检验历史观测能否在残差上进一步降低当前观测未能消除的预测误差。所得评分区间为[0, 1]，且无需构建因果图。评估覆盖六类环境（CartPole、Pendulum、Acrobot、HalfCheetah、Hopper、Walker2d）、三种算法（PPO、A2C、SAC）、六种强度的受控AR(1)噪声以及每种条件下10个随机种子。在事后检测中，16个环境-算法配对里有7个（主要为高维运动控制任务）显示噪声强度与违背评分之间存在显著正单调性（Spearman ρ最高达0.78，经重复测量分析验证）；在训练阶段注入噪声时，16个配对中有13个出现统计显著的奖励衰减现象。研究记录了低维环境中的反转现象：随机森林吸收了噪声信号，导致评分在真实违背加剧时反而下降，本文深入分析了该失效模式。一项实用效用实验表明，所提评分能正确识别部分可观测性并指导架构选择，完全恢复因非马尔可夫观测损失的性能。重现所有结果的源代码已发布于https://github.com/NAVEENMN/Markovianes。