Toward Learning POMDPs Beyond Full-Rank Actions and State Observability

We are interested in enabling autonomous agents to learn and reason about systems with hidden states, such as locking mechanisms. We cast this problem as learning the parameters of a discrete Partially Observable Markov Decision Process (POMDP). The agent begins with knowledge of the POMDP's actions and observation spaces, but not its state space, transitions, or observation models. These properties must be constructed from a sequence of actions and observations. Spectral approaches to learning models of partially observable domains, such as Predictive State Representations (PSRs), learn representations of state that are sufficient to predict future outcomes. PSR models, however, do not have explicit transition and observation system models that can be used with different reward functions to solve different planning problems. Under a mild set of rankness assumptions on the products of transition and observation matrices, we show how PSRs learn POMDP matrices up to a similarity transform, and this transform may be estimated via tensor decomposition methods. Our method learns observation matrices and transition matrices up to a partition of states, where the states in a single partition have the same observation distributions corresponding to actions whose transition matrices are full-rank. Our experiments suggest that explicit observation and transition likelihoods can be leveraged to generate new plans for different goals and reward functions after the model has been learned. We also show that learning a POMDP beyond a partition of states is impossible from sequential data by constructing two POMDPs that agree on all observation distributions but differ in their transition dynamics.

翻译：我们致力于使自主智能体能够学习和推理具有隐藏状态的系统，例如锁定机制。我们将此问题建模为学习离散部分可观测马尔可夫决策过程（POMDP）的参数。智能体初始时已知POMDP的动作空间和观测空间，但不知其状态空间、转移模型或观测模型。这些属性必须通过一系列动作和观测序列来构建。用于学习部分可观测领域模型的光谱方法，例如预测状态表示（PSR），学习的状态表示足以预测未来结果。然而，PSR模型缺乏显式的转移和观测系统模型，这些模型可与不同奖励函数结合以解决不同的规划问题。在转移矩阵与观测矩阵乘积满足温和秩条件假设下，我们证明了PSR如何学习POMDP矩阵至相似变换的程度，且该变换可通过张量分解方法进行估计。我们的方法学习观测矩阵和转移矩阵至状态划分的程度，其中同一划分内的状态具有相同的观测分布，对应于转移矩阵为全秩的动作。实验表明，在模型学习完成后，显式的观测和转移似然可用于针对不同目标和奖励函数生成新规划。我们还通过构造两个在所有观测分布上一致但转移动态不同的POMDP，证明了仅凭序列数据无法学习超越状态划分的POMDP。