Partially Observable Markov Decision Processes (POMDPs) are a powerful framework for planning under uncertainty. They allow to model state uncertainty as a belief probability distribution. Approximate solvers based on Monte Carlo sampling show great success to relax the computational demand and perform online planning. However, scaling to complex realistic domains with many actions and long planning horizons is still a major challenge, and a key point to achieve good performance is guiding the action-selection process with domain-dependent policy heuristics which are tailored for the specific application domain. We propose to learn high-quality heuristics from POMDP traces of executions generated by any solver. We convert the belief-action pairs to a logical semantics, and exploit data- and time-efficient Inductive Logic Programming (ILP) to generate interpretable belief-based policy specifications, which are then used as online heuristics. We evaluate thoroughly our methodology on two notoriously challenging POMDP problems, involving large action spaces and long planning horizons, namely, rocksample and pocman. Considering different state-of-the-art online POMDP solvers, including POMCP, DESPOT and AdaOPS, we show that learned heuristics expressed in Answer Set Programming (ASP) yield performance superior to neural networks and similar to optimal handcrafted task-specific heuristics within lower computational time. Moreover, they well generalize to more challenging scenarios not experienced in the training phase (e.g., increasing rocks and grid size in rocksample, incrementing the size of the map and the aggressivity of ghosts in pocman).
翻译:部分可观测马尔可夫决策过程(POMDPs)是在不确定条件下进行规划的强大框架。它允许将状态不确定性建模为信念概率分布。基于蒙特卡洛采样的近似求解器在降低计算需求并执行在线规划方面表现出巨大成功。然而,扩展到具有大量动作和长规划时域的现实复杂领域仍是一大挑战,而实现良好性能的关键在于使用针对特定应用领域定制的领域相关策略启发式方法来指导动作选择过程。我们提出从任意求解器生成的POMDP执行轨迹中学习高质量启发式方法。将信念-动作对转换为逻辑语义,并利用数据和时间高效的归纳逻辑编程(ILP)生成可解释的基于信念的策略规范,随后将其用作在线启发式方法。我们在两个著名的具有挑战性的POMDP问题(即rocksample和pocman)上全面评估了我们的方法,这两个问题涉及大动作空间和长规划时域。考虑不同的最新在线POMDP求解器(包括POMCP、DESPOT和AdaOPS),我们展示了以回答集编程(ASP)表达的学习启发式方法在较低计算时间内实现了优于神经网络的性能,并与最优的、手工设计的任务特定启发式方法性能相当。此外,这些方法能够很好地泛化到训练阶段未遇到的更具挑战性场景(例如,在rocksample中增加矿石数量和网格尺寸,在pocman中增大地图尺寸和鬼魂的侵略性)。