Strategies for partially observable Markov decision processes (POMDP) typically require memory. One way to represent this memory is via automata. We present a method to learn an automaton representation of a strategy using the L*-algorithm. Compared to the tabular representation of a strategy, the resulting automaton is dramatically smaller and thus also more explainable. Moreover, in the learning process, our heuristics may even improve the strategy's performance. In contrast to approaches that synthesize an automaton directly from the POMDP thereby solving it, our approach is incomparably more scalable.
翻译:部分可观测马尔可夫决策过程(POMDP)的策略通常需要记忆。表示这种记忆的一种方式是通过自动机。我们提出了一种利用L*算法学习策略的自动机表示的方法。与策略的表格表示相比,生成的自动机规模显著更小,因此也更具可解释性。此外,在学习过程中,我们的启发式方法甚至可能提升策略的性能。与直接通过POMDP综合自动机从而求解该问题的方法不同,我们的方法具有无可比拟的可扩展性。