Partially Observable Monte Carlo Planning (POMCP) is an efficient solver for Partially Observable Markov Decision Processes (POMDPs). It allows scaling to large state spaces by computing an approximation of the optimal policy locally and online, using a Monte Carlo Tree Search based strategy. However, POMCP suffers from sparse reward function, namely, rewards achieved only when the final goal is reached, particularly in environments with large state spaces and long horizons. Recently, logic specifications have been integrated into POMCP to guide exploration and to satisfy safety requirements. However, such policy-related rules require manual definition by domain experts, especially in real-world scenarios. In this paper, we use inductive logic programming to learn logic specifications from traces of POMCP executions, i.e., sets of belief-action pairs generated by the planner. Specifically, we learn rules expressed in the paradigm of answer set programming. We then integrate them inside POMCP to provide soft policy bias toward promising actions. In the context of two benchmark scenarios, rocksample and battery, we show that the integration of learned rules from small task instances can improve performance with fewer Monte Carlo simulations and in larger task instances. We make our modified version of POMCP publicly available at https://github.com/GiuMaz/pomcp_clingo.git.
翻译:部分可观测蒙特卡洛规划(POMCP)是求解部分可观测马尔可夫决策过程(POMDP)的高效求解器。它通过基于蒙特卡洛树搜索策略的局部在线近似计算最优策略,从而可扩展到大规模状态空间。然而,POMCP面临稀疏奖励函数问题——仅在最终目标达成时获得奖励,尤其在具有大规模状态空间和长时域的环境中更为突出。近年来,逻辑规约被集成到POMCP中以引导探索并满足安全需求。然而,此类策略相关规则需要领域专家手动定义,尤其在真实场景中。本文利用归纳逻辑编程从POMCP执行轨迹(即规划器生成的信念-动作对集合)中学习逻辑规约。具体而言,我们学习以回答集编程范式表达的规则,并将其集成到POMCP中,为有前景的动作提供软策略偏置。在rocksample和battery两个基准场景中,我们证明了从小规模任务实例中学习到的规则集成,可通过更少的蒙特卡洛模拟在更大任务实例中提升性能。我们已公开发布修改版POMCP代码:https://github.com/GiuMaz/pomcp_clingo.git。