Real-world reinforcement learning is often \emph{nonstationary}: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emph{how much} the environment changes, not \emph{how} it changes locally -- even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length-curvature-kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textit{Homotopy-Tracking RL (HT-RL)} and \textit{HT-MCTS}, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.
翻译:现实世界中的强化学习通常是**非平稳的**:奖励与动态特性会发生漂移、加速、振荡,并触发最优动作的突然切换。现有理论常使用粗粒度模型来表示非平稳性,这些模型衡量环境**变化了多少**,而非其局部**如何变化**——尽管加速度与接近平局的状态正是导致跟踪误差与策略抖振的关键因素。本文从几何视角研究非平稳折扣马尔可夫决策过程(MDPs),将环境建模为可微同伦路径,并跟踪由此引发的最优贝尔曼不动点的运动轨迹。该方法导出了一个表征内在复杂性的长度-曲率-转折点特征:累积漂移、加速度/振荡以及由动作间隙诱导的非光滑性。我们证明了一个与求解器无关的路径积分稳定性界,并推导出间隙安全的可行区域,该区域能证明远离切换机制时的局部稳定性。基于这些结果,我们提出了**同伦跟踪强化学习(HT-RL)**与**HT-MCTS**——两种轻量级封装器,可在线估计基于经验回放的长度、曲率及接近平局程度的代理指标,并据此自适应调整学习或规划强度。实验表明,相较于匹配的静态基线方法,所提方法在跟踪性能与动态遗憾上均有提升,且在振荡性与易切换机制中改善最为显著。