Strategies for partially observable Markov decision processes (POMDP) typically require memory. One way to represent this memory is via automata. We present a method to learn an automaton representation of a strategy using a modification of the L*-algorithm. Compared to the tabular representation of a strategy, the resulting automaton is dramatically smaller and thus also more explainable. Moreover, in the learning process, our heuristics may even improve the strategy's performance. In contrast to approaches that synthesize an automaton directly from the POMDP thereby solving it, our approach is incomparably more scalable.
翻译:部分可观测马尔可夫决策过程(POMDP)的策略通常需要存储记忆。一种表示此类记忆的方式是通过自动机。我们提出了一种基于改进L*算法学习策略自动机表示的方法。与策略的表格表示相比,最终生成的自动机规模显著缩小,因而更具可解释性。此外,在学习过程中,我们的启发式方法甚至能提升策略的性能。与直接从POMDP综合出自动机从而求解问题的传统方法不同,我们的方法在可扩展性上具有无可比拟的优势。