Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches leverage structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When the transition graph is unknown or the state space is prohibitively large, the graph spectral features can be estimated directly via sample trajectories. In this work, we prove an upper bound on the approximation error of linear value function approximation under the learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, grounding the approximation quality in the topological structure of the MDP. We further bound the error introduced by the eigenvector estimation itself, leading to an end-to-end error decomposition across the representation learning pipeline. Additionally, our expression of the Laplacian operator for the RL setting, although equivalent to existing ones, prevents some common misunderstandings, of which we show some examples from the literature. Our results hold for general (non-uniform) policies without any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.
翻译:在马尔可夫决策过程(MDPs)中学习紧凑的状态表示,已被证明对于解决大规模强化学习(RL)问题中的维度灾难至关重要。现有的原理性方法通过将状态表示构建为状态图拉普拉斯特征向量的线性组合,来利用MDP的结构先验。当转移图未知或状态空间过大时,图谱特征可以通过采样轨迹直接估计。在本工作中,我们证明了在学习的谱特征下线性值函数逼近的近似误差上界。我们展示了该误差如何随状态图的代数连通性缩放,从而将逼近质量锚定在MDP的拓扑结构中。我们进一步界定了特征向量估计本身引入的误差,从而实现了对整个表示学习流程的端到端误差分解。此外,我们针对RL场景提出的拉普拉斯算子表达式,虽然与现有表达式等价,但避免了一些常见的误解,我们展示了文献中的若干示例。我们的结果适用于一般(非均匀)策略,且无需对诱导转移核的对称性作任何假设。我们在网格世界环境中通过数值模拟验证了理论发现。