To plan safely in uncertain environments, agents must balance utility with safety constraints. Safe planning problems can be modeled as a chance-constrained partially observable Markov decision process (CC-POMDP) and solutions often use expensive rollouts or heuristics to estimate the optimal value and action-selection policy. This work introduces the ConstrainedZero policy iteration algorithm that solves CC-POMDPs in belief space by learning neural network approximations of the optimal value and policy with an additional network head that estimates the failure probability given a belief. This failure probability guides safe action selection during online Monte Carlo tree search (MCTS). To avoid overemphasizing search based on the failure estimates, we introduce $\Delta$-MCTS, which uses adaptive conformal inference to update the failure threshold during planning. The approach is tested on a safety-critical POMDP benchmark, an aircraft collision avoidance system, and the sustainability problem of safe CO$_2$ storage. Results show that by separating safety constraints from the objective we can achieve a target level of safety without optimizing the balance between rewards and costs.
翻译:为了在不确定环境中进行安全规划,智能体必须平衡实用性与安全约束。安全规划问题可以建模为带有机会约束的部分可观测马尔可夫决策过程(CC-POMDP),现有解决方案通常使用昂贵的仿真展开或启发式方法来估计最优值和行动选择策略。本文提出了ConstrainedZero策略迭代算法,该算法通过在信念空间中学习最优值和策略的神经网络近似,并增设一个用于估计给定信念状态下失效概率的网络头来求解CC-POMDP问题。该失效概率在在线蒙特卡洛树搜索(MCTS)过程中指导安全行动选择。为避免基于失效估计的过度搜索,我们引入了$\Delta$-MCTS方法,该方法利用自适应共形推理在规划过程中动态更新失效阈值。该方法在安全关键型POMDP基准测试、飞机防撞系统以及CO$_2$安全封存可持续性问题上进行了测试。结果表明,通过将安全约束与目标函数分离,可以在不优化奖励与成本平衡的前提下达到目标安全水平。