Partially Observable Markov Decision Processes (POMDPs) are powerful models for sequential decision making under transition and observation uncertainties. This paper studies the challenging yet important problem in POMDPs known as the (indefinite-horizon) Maximal Reachability Probability Problem (MRPP), where the goal is to maximize the probability of reaching some target states. This is also a core problem in model checking with logical specifications and is naturally undiscounted (discount factor is one). Inspired by the success of point-based methods developed for discounted problems, we study their extensions to MRPP. Specifically, we focus on trial-based heuristic search value iteration techniques and present a novel algorithm that leverages the strengths of these techniques for efficient exploration of the belief space (informed search via value bounds) while addressing their drawbacks in handling loops for indefinite-horizon problems. The algorithm produces policies with two-sided bounds on optimal reachability probabilities. We prove convergence to an optimal policy from below under certain conditions. Experimental evaluations on a suite of benchmarks show that our algorithm outperforms existing methods in almost all cases in both probability guarantees and computation time.
翻译:部分可观测马尔可夫决策过程(POMDP)是处理转移与观测不确定性下序贯决策问题的强大模型。本文研究POMDP中具有挑战性且重要的(无限时域)最大可达概率问题(MRPP),其目标是最优化到达某些目标状态的概率。该问题也是模型检测中逻辑规约验证的核心问题,天然无折扣(折扣因子为1)。受折扣问题中点基方法成功应用的启发,我们探索了这些方法在MRPP中的扩展。具体而言,我们聚焦于基于试验的启发式搜索值迭代技术,提出一种新颖算法,该算法利用这些技术在信念空间高效探索(通过值边界的知情搜索)的优势,同时解决其在处理无限时域循环问题中的不足。该算法生成的策略具有最优可达概率的双侧界。我们证明了在特定条件下算法从下方收敛至最优策略。在系列基准测试上的实验评估表明,我们的算法在概率保障与计算时间两方面几乎全面优于现有方法。