Strategies for partially observable Markov decision processes (POMDP) typically require memory. One way to represent this memory is via automata. We present a method to learn an automaton representation of a strategy using a modification of the L*-algorithm. Compared to the tabular representation of a strategy, the resulting automaton is dramatically smaller and thus also more explainable. Moreover, in the learning process, our heuristics may even improve the strategy's performance. In contrast to approaches that synthesize an automaton directly from the POMDP thereby solving it, our approach is incomparably more scalable.
翻译:部分可观察马尔可夫决策过程(POMDP)的策略通常需要记忆。表示这种记忆的一种方式是通过自动机。我们提出了一种利用改进的L*算法来学习策略自动机表示的方法。与策略的表格表示相比,所得自动机显著更小,因而也更具可解释性。此外,在学习过程中,我们的启发式方法甚至可能提升策略的性能。与直接从POMDP合成自动机从而求解该问题的方法相比,我们的方法具有无可比拟的可扩展性。