Continuous POMDPs with general belief-dependent rewards are notoriously difficult to solve online. In this paper, we present a complete provable theory of adaptive multilevel simplification for the setting of a given externally constructed belief tree and MCTS that constructs the belief tree on the fly using an exploration technique. Our theory allows to accelerate POMDP planning with belief-dependent rewards without any sacrifice in the quality of the obtained solution. We rigorously prove each theoretical claim in the proposed unified theory. Using the general theoretical results, we present three algorithms to accelerate continuous POMDP online planning with belief-dependent rewards. Our two algorithms, SITH-BSP and LAZY-SITH-BSP, can be utilized on top of any method that constructs a belief tree externally. The third algorithm, SITH-PFT, is an anytime MCTS method that permits to plug-in any exploration technique. All our methods are guaranteed to return exactly the same optimal action as their unsimplified equivalents. We replace the costly computation of information-theoretic rewards with novel adaptive upper and lower bounds which we derive in this paper, and are of independent interest. We show that they are easy to calculate and can be tightened by the demand of our algorithms. Our approach is general; namely, any bounds that monotonically converge to the reward can be easily plugged-in to achieve significant speedup without any loss in performance. Our theory and algorithms support the challenging setting of continuous states, actions, and observations. The beliefs can be parametric or general and represented by weighted particles. We demonstrate in simulation a significant speedup in planning compared to baseline approaches with guaranteed identical performance.
翻译:具有一般信度依赖奖励的连续POMDP(部分可观测马尔可夫决策过程)在线求解异常困难。本文针对给定外部构建的信度树以及基于探索技术动态构建信度树的MCTS(蒙特卡洛树搜索)场景,提出了一套完备的可证明自适应多级简化理论。该理论能够在不牺牲求解质量的前提下,加速具有信度依赖奖励的POMDP规划过程。我们对所提出的统一理论中的每个理论主张进行了严格证明。基于这些通用理论成果,我们提出了三种算法来加速具有信度依赖奖励的连续POMDP在线规划。其中两种算法——SITH-BSP与LAZY-SITH-BSP——可应用于任何外部构建信度树的方法之上;第三种算法SITH-PFT则是一种支持任意探索技术插件的即时MCTS方法。所有方法均保证返回与未简化等效方法完全相同的最优动作。本文推导了新颖的自适应上下界来替代计算昂贵的信息论奖励函数,该上下界本身具有独立研究价值。我们证明这些界易于计算,并可通过算法需求进行收紧。本方法具有通用性:任何单调收敛于奖励的界均可便捷嵌入,在零性能损失条件下实现显著加速。我们的理论与算法支持连续状态、连续动作及连续观测等挑战性场景,信度分布既可为参数化形式也可为一般形式,并通过加权粒子表示。仿真实验表明,与基线方法相比,本方法在保证完全相同性能的前提下实现了显著的规划加速。