The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as $α$-vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on $α$-vectors at reachable belief points until convergence. However, since each $α$-vector is $|S|$-dimensional, these methods quickly become intractable for large-scale problems due to the prohibitive computational cost of Bellman backups. In this work, we demonstrate that the PWLC property allows a POMDP's value function to be alternatively represented as a finite set of neural networks. This insight enables a novel POMDP planning algorithm called \emph{Neural Value Iteration}, which combines the generalization capability of neural networks with the classical value iteration framework. Our approach achieves near-optimal solutions even in extremely large POMDPs that are intractable for existing offline solvers.
翻译:部分可观测马尔可夫决策过程(POMDP)的价值函数具有分段线性凸(PWLC)性质,可表示为一组有限的超平面,即$α$-向量。大多数先进的POMDP求解器(离线规划器)遵循基于点的价值迭代方案,该方案在可达信念点处对$α$-向量执行贝尔曼备份直至收敛。然而,由于每个$α$-向量是$|S|$维的,贝尔曼备份的过高计算成本导致这些方法在处理大规模问题时迅速变得难以处理。本工作证明,PWLC性质允许将POMDP的价值函数替代地表示为一组有限的神经网络。这一洞见催生了一种名为\emph{神经价值迭代}的新型POMDP规划算法,它将神经网络的泛化能力与经典价值迭代框架相结合。即使在现有离线求解器无法处理的超大规模POMDP中,我们的方法仍能获得接近最优的解。