Strategies for partially observable Markov decision processes (POMDP) typically require memory. One way to represent this memory is via automata. We present a method to learn an automaton representation of a strategy using a modification of the L*-algorithm. Compared to the tabular representation of a strategy, the resulting automaton is dramatically smaller and thus also more explainable. Moreover, in the learning process, our heuristics may even improve the strategy's performance. In contrast to approaches that synthesize an automaton directly from the POMDP thereby solving it, our approach is incomparably more scalable.
翻译:部分可观测马尔可夫决策过程(POMDP)的策略通常需要记忆。表示这种记忆的一种方式是通过自动机。我们提出了一种方法,利用改进的L*算法来学习策略的自动机表示。与策略的表格表示相比,生成的自动机规模显著减小,因此也更具有可解释性。此外,在学习过程中,我们的启发式方法甚至可能提升策略的性能。与直接从POMDP综合自动机从而求解该过程的方案相比,我们的方法在可扩展性方面具有无可比拟的优势。