State aggregation aims to reduce the computational complexity of solving Markov Decision Processes (MDPs) while preserving the performance of the original system. A fundamental challenge lies in optimizing policies within the aggregated, or abstract, space such that the performance remains optimal in the ground MDP-a property referred to as {"}optimal policy equivalence {"}. This paper presents an abstraction framework based on the notion of homomorphism, in which two Markov chains are deemed homomorphic if their value functions exhibit a linear relationship. Within this theoretical framework, we establish a sufficient condition for the equivalence of optimal policy. We further examine scenarios where the sufficient condition is not met and derive an upper bound on the approximation error and a performance lower bound for the objective function under the ground MDP. We propose Homomorphic Policy Gradient (HPG), which guarantees optimal policy equivalence under sufficient conditions, and its extension, Error-Bounded HPG (EBHPG), which balances computational efficiency and the performance loss induced by aggregation. In the experiments, we validated the theoretical results and conducted comparative evaluations against seven algorithms.
翻译:状态聚合旨在降低求解马尔可夫决策过程(MDPs)的计算复杂度,同时保持原系统的性能。一个根本性挑战在于如何在聚合(或抽象)空间内优化策略,使得其在底层MDP中仍能保持最优性能——这一性质被称为“最优策略等价性”。本文提出了一种基于同态概念的抽象框架,其中两个马尔可夫链若其价值函数呈线性关系,则被视为同态的。在此理论框架下,我们建立了最优策略等价的充分条件。我们进一步研究了不满足该充分条件的情形,推导了近似误差的上界以及底层MDP中目标函数的性能下界。我们提出了同态策略梯度(HPG),该算法在充分条件下保证最优策略等价性;并进一步提出了其扩展版本——误差有界HPG(EBHPG),该算法在计算效率与聚合引入的性能损失之间进行权衡。实验中,我们验证了理论结果,并与七种算法进行了对比评估。