We are interested in enabling autonomous agents to learn and reason about systems with hidden states, such as furniture with hidden locking mechanisms. We cast this problem as learning the parameters of a discrete Partially Observable Markov Decision Process (POMDP). The agent begins with knowledge of the POMDP's actions and observation spaces, but not its state space, transitions, or observation models. These properties must be constructed from action-observation sequences. Spectral approaches to learning models of partially observable domains, such as learning Predictive State Representations (PSRs), are known to directly estimate the number of hidden states. These methods cannot, however, yield direct estimates of transition and observation likelihoods, which are important for many downstream reasoning tasks. Other approaches leverage tensor decompositions to estimate transition and observation likelihoods but often assume full state observability and full-rank transition matrices for all actions. To relax these assumptions, we study how PSRs learn transition and observation matrices up to a similarity transform, which may be estimated via tensor methods. Our method learns observation matrices and transition matrices up to a partition of states, where the states in a single partition have the same observation distributions corresponding to actions whose transition matrices are full-rank. Our experiments suggest that these partition-level transition models learned by our method, with a sufficient amount of data, meets the performance of PSRs as models to be used by standard sampling-based POMDP solvers. Furthermore, the explicit observation and transition likelihoods can be leveraged to specify planner behavior after the model has been learned.
翻译:我们致力于使自主智能体能够学习和推理具有隐藏状态的系统,例如带有隐藏锁定机制的家具。我们将此问题建模为学习离散部分可观测马尔可夫决策过程(POMDP)的参数。智能体初始时已知POMDP的动作空间和观测空间,但不知其状态空间、转移模型或观测模型。这些属性必须从动作-观测序列中构建。学习部分可观测领域模型的光谱方法,例如学习预测状态表示(PSR),已知能直接估计隐藏状态的数量。然而,这些方法无法直接估计转移概率和观测似然,这对于许多下游推理任务至关重要。其他方法利用张量分解来估计转移概率和观测似然,但通常假设所有动作具有完全状态可观测性和满秩转移矩阵。为了放宽这些假设,我们研究了PSR如何学习转移矩阵和观测矩阵至一个相似变换,该变换可通过张量方法估计。我们的方法学习观测矩阵和转移矩阵至状态的一个划分,其中同一划分内的状态对于转移矩阵为满秩的动作具有相同的观测分布。我们的实验表明,通过足够的数据,我们的方法学习到的划分级转移模型,在作为标准基于采样的POMDP求解器所使用的模型时,其性能与PSR相当。此外,在模型学习完成后,显式的观测和转移似然可用于规划器的行为指定。