Solving continuous Partially Observable Markov Decision Processes (POMDPs) is challenging, particularly for high-dimensional continuous action spaces. To alleviate this difficulty, we propose a new sampling-based online POMDP solver, called Adaptive Discretization using Voronoi Trees (ADVT). It uses Monte Carlo Tree Search in combination with an adaptive discretization of the action space as well as optimistic optimization to efficiently sample high-dimensional continuous action spaces and compute the best action to perform. Specifically, we adaptively discretize the action space for each sampled belief using a hierarchical partition called Voronoi tree, which is a Binary Space Partitioning that implicitly maintains the partition of a cell as the Voronoi diagram of two points sampled from the cell. ADVT uses the estimated diameters of the cells to form an upper-confidence bound on the action value function within the cell, guiding the Monte Carlo Tree Search expansion and further discretization of the action space. This enables ADVT to better exploit local information with respect to the action value function, allowing faster identification of the most promising regions in the action space, compared to existing solvers. Voronoi trees keep the cost of partitioning and estimating the diameter of each cell low, even in high-dimensional spaces where many sampled points are required to cover the space well. ADVT additionally handles continuous observation spaces, by adopting an observation progressive widening strategy, along with a weighted particle representation of beliefs. Experimental results indicate that ADVT scales substantially better to high-dimensional continuous action spaces, compared to state-of-the-art methods.
翻译:求解连续部分可观测马尔可夫决策过程(POMDP)具有挑战性,尤其是对于高维连续动作空间。为缓解这一难题,我们提出一种新的基于采样的在线POMDP求解器,称为基于Voronoi树的自适应离散化方法(ADVT)。该方法结合蒙特卡洛树搜索、动作空间的自适应离散化以及乐观优化,高效采样高维连续动作空间并计算最优动作。具体而言,我们针对每个采样信念状态,利用称为Voronoi树的分层划分结构自适应离散化动作空间——该结构是一种二叉空间划分,通过从单元中采样的两个点构成的Voronoi图隐式维护单元划分。ADVT利用各单元的估计直径构建动作价值函数的上置信界,引导蒙特卡洛树搜索的扩展与动作空间的进一步离散化。这使得ADVT能更充分地利用动作价值函数的局部信息,相比现有求解器更快速识别动作空间中具有最潜在价值的区域。Voronoi树即使在高维空间中(需大量采样点才能充分覆盖空间)仍能保持低划分成本与低直径估计成本。此外,ADVT通过采用观测渐进增广策略及加权粒子信念表示,可处理连续观测空间。实验结果表明,相较于现有先进方法,ADVT在应对高维连续动作空间时展现出显著更优的可扩展性。