Non-Markovian dynamics are commonly found in real-world environments due to long-range dependencies, partial observability, and memory effects. The Bellman equation that is the central pillar of Reinforcement learning (RL) becomes only approximately valid under Non-Markovian. Existing work often focus on practical algorithm designs and offer limited theoretical treatment to address key questions, such as what dynamics are indeed capturable by the Bellman framework and how to inspire new algorithm classes with optimal approximations. In this paper, we present a novel topological viewpoint on temporal-difference (TD) based RL. We show that TD errors can be viewed as 1-cochain in the topological space of state transitions, while Markov dynamics are then interpreted as topological integrability. This novel view enables us to obtain a Hodge-type decomposition of TD errors into an integrable component and a topological residual, through a Bellman-de Rham projection. We further propose HodgeFlow Policy Search (HFPS) by fitting a potential network to minimize the non-integrable projection residual in RL, achieving stability/sensitivity guarantees. In numerical evaluations, HFPS is shown to significantly improve RL performance under non-Markovian.
翻译:现实世界环境中,由于长程依赖、部分可观测性及记忆效应,非马尔可夫动力学普遍存在。作为强化学习(RL)核心支柱的贝尔曼方程在非马尔可夫环境下仅近似成立。现有研究多聚焦于实用算法设计,对关键问题的理论探讨有限,例如:贝尔曼框架究竟能捕捉何种动力学?如何通过最优近似启发新算法类别?本文提出一种基于时序差分(TD)的强化学习的全新拓扑视角。我们证明,在状态转移的拓扑空间中,TD误差可视为1-上链,而马尔可夫动力学则可诠释为拓扑可积性。这一新颖视角使我们能通过贝尔曼-德拉姆投影,将TD误差分解为可积分量与拓扑残差,得到霍奇型分解。我们进一步提出霍奇流策略搜索(HFPS),通过拟合势网络以最小化强化学习中的不可积投影残差,从而获得稳定性/敏感性保证。数值评估表明,HFPS在非马尔可夫环境下显著提升了强化学习性能。