We study the geometry of feasible value functions in infinite-horizon partially observable Markov decision processes (POMDPs) under memoryless stochastic policies. Our main contribution is a characterization of the feasible set of value functions as a semi-algebraic set, defined by explicit polynomial inequalities determined by the transition dynamics, observation kernel, and reward structure of the POMDP. This result extends prior work for fully observable Markov decision processes, where the feasible set is known to be a polytope, to the substantially more intricate partially observable setting. In contrast to the polyhedral structure arising in MDPs, partial observability induces fundamentally nonlinear constraints, leading to a richer and more complex geometric structure. Our geometric characterization provides new insight into the landscape of policy optimization in both MDPs and POMDPs, and reveals qualitative phenomena unique to partial observability, including the emergence of isolated local maximizers of the long-term reward and their dependence on the initial state distribution.
翻译:我们研究了无记忆随机策略下无穷时域部分可观测马尔可夫决策过程(POMDPs)中可行值函数的几何结构。主要贡献在于将可行值函数集合刻画为一个半代数集,该集合由POMDP的转移动力学、观测核与奖励结构所决定的显式多项式不等式定义。这一结果将先前关于完全可观测马尔可夫决策过程(其中可行集已知为多面体)的研究扩展至结构更为复杂的部分可观测情形。与MDP中出现的多面体结构不同,部分可观测性引发了本质非线性的约束条件,从而形成更丰富、更复杂的几何结构。我们的几何刻画为MDP与POMDP的策略优化景观提供了新洞见,并揭示了部分可观测性特有的定性现象,包括长期奖励孤立局部极大值的出现及其对初始状态分布的依赖性。