Point-based Value Iteration for Neuro-Symbolic POMDPs

Neuro-symbolic artificial intelligence is an emerging area that combines traditional symbolic techniques with neural networks. In this paper, we consider its application to sequential decision making under uncertainty. We introduce neuro-symbolic partially observable Markov decision processes (NS-POMDPs), which model an agent that perceives a continuous-state environment using a neural network and makes decisions symbolically, and study the problem of optimising discounted cumulative rewards. This requires functions over continuous-state beliefs, for which we propose a novel piecewise linear and convex representation (P-PWLC) in terms of polyhedra covering the continuous-state space and value vectors, and extend Bellman backups to this representation. We prove the convexity and continuity of value functions and present two value iteration algorithms that ensure finite representability by exploiting the underlying structure of the continuous-state model and the neural perception mechanism. The first is a classical (exact) value iteration algorithm extending $\alpha$-functions of Porta et al (2006) to the P-PWLC representation for continuous-state spaces. The second is a point-based (approximate) method called NS-HSVI, which uses the P-PWLC representation and belief-value induced functions to approximate value functions from below and above for two types of beliefs, particle-based and region-based. Using a prototype implementation, we show the practical applicability of our approach on two case studies that employ (trained) ReLU neural networks as perception functions, dynamic car parking and an aircraft collision avoidance system, by synthesising (approximately) optimal strategies. An experimental comparison with the finite-state POMDP solver SARSOP demonstrates that NS-HSVI is more robust to particle disturbances.

翻译：神经符号人工智能是一个新兴领域，它结合了传统符号技术与神经网络。在本文中，我们考虑其在不确定条件下的序贯决策中的应用。我们引入神经符号部分可观测马尔可夫决策过程（NS-POMDP），该模型描述了一个使用神经网络感知连续状态环境并符号化地做出决策的智能体，并研究了优化折扣累积奖励的问题。这需要定义在连续状态信念上的函数，为此我们提出了一种新颖的分段线性凸表示（P-PWLC），该表示基于覆盖连续状态空间的多面体与价值向量，并将Bellman回溯扩展到该表示形式。我们证明了价值函数的凸性和连续性，并提出了两种利用连续状态模型内在结构和神经感知机制确保有限可表示性的价值迭代算法。第一种是经典（精确）的价值迭代算法，它将Porta等人（2006）的$\alpha$-函数扩展到适用于连续状态空间的P-PWLC表示。第二种是基于点（近似）的方法，称为NS-HSVI，它利用P-PWLC表示和信念-价值诱导函数从上下两个方向逼近两种类型信念（基于粒子和基于区域）的价值函数。通过原型实现，我们在两个案例研究（采用训练后的ReLU神经网络作为感知函数的动态停车系统和飞机防撞系统）中展示了该方法合成（近似）最优策略的实际可行性。与有限状态POMDP求解器SARSOP的实验对比表明，NS-HSVI对粒子扰动具有更强的鲁棒性。