Continuous POMDPs with general belief-dependent rewards are notoriously difficult to solve online. In this paper, we present a complete provable theory of adaptive multilevel simplification for the setting of a given externally constructed belief tree and MCTS that constructs the belief tree on the fly using an exploration technique. Our theory allows to accelerate POMDP planning with belief-dependent rewards without any sacrifice in the quality of the obtained solution. We rigorously prove each theoretical claim in the proposed unified theory. Using the general theoretical results, we present three algorithms to accelerate continuous POMDP online planning with belief-dependent rewards. Our two algorithms, SITH-BSP and LAZY-SITH-BSP, can be utilized on top of any method that constructs a belief tree externally. The third algorithm, SITH-PFT, is an anytime MCTS method that permits to plug-in any exploration technique. All our methods are guaranteed to return exactly the same optimal action as their unsimplified equivalents. We replace the costly computation of information-theoretic rewards with novel adaptive upper and lower bounds which we derive in this paper, and are of independent interest. We show that they are easy to calculate and can be tightened by the demand of our algorithms. Our approach is general; namely, any bounds that monotonically converge to the reward can be utilized to achieve significant speedup without any loss in performance. Our theory and algorithms support the challenging setting of continuous states, actions, and observations. The beliefs can be parametric or general and represented by weighted particles. We demonstrate in simulation a significant speedup in planning compared to baseline approaches with guaranteed identical performance.
翻译:具有一般依赖信念奖励的连续部分可观测马尔可夫决策过程(POMDP)的在线求解 notoriously 困难。本文针对两种场景提出了一套完整的、可证明的自适应多级简化理论:一是给定外部构建的信念树,二是蒙特卡洛树搜索(MCTS)通过探索技术动态构建信念树。我们的理论能够在保证所得解质量零损失的前提下,加速依赖信念奖励的POMDP规划。我们对所提出的统一理论中的每个理论主张都进行了严格证明。基于这些一般性理论结果,我们提出了三种算法来加速具有依赖信念奖励的连续POMDP在线规划。我们的两种算法,SITH-BSP 和 LAZY-SITH-BSP,可以应用于任何外部构建信念树的方法之上。第三种算法,SITH-PFT,是一种允许嵌入任何探索技术的任意时间MCTS方法。我们所有的方法都保证返回与其未简化版本完全相同的**最优动作**。我们用本文推导出的新颖自适应上界和下界替代了代价高昂的信息论奖励计算,这些界本身也具有独立的研究价值。我们证明了它们易于计算,并且可以根据算法需求进行收紧。我们的方法是通用的;即,任何单调收敛于奖励的界都可以被用来实现显著的加速,且不损失任何性能。我们的理论和算法支持连续状态、动作和观测这一具有挑战性的设置。信念可以是参数化的,也可以是通用的,并由加权粒子表示。仿真实验表明,与基线方法相比,我们的方法在保证性能完全一致的前提下,实现了显著的规划加速。