Correct-by-Construction Vision-based Pose Estimation using Geometric Generative Models

We consider the problem of vision-based pose estimation for autonomous systems. While deep neural networks have been successfully used for vision-based tasks, they inherently lack provable guarantees on the correctness of their output, which is crucial for safety-critical applications. We present a framework for designing certifiable neural networks (NNs) for perception-based pose estimation that integrates physics-driven modeling with learning-based estimation. The proposed framework begins by leveraging the known geometry of planar objects commonly found in the environment, such as traffic signs and runway markings, referred to as target objects. At its core, it introduces a geometric generative model (GGM), a neural-network-like model whose parameters are derived from the image formation process of a target object observed by a camera. Once designed, the GGM can be used to train NN-based pose estimators with certified guarantees in terms of their estimation errors. We first demonstrate this framework in uncluttered environments, where the target object is the only object present in the camera's field of view. We extend this using ideas from NN reachability analysis to design certified object NN that can detect the presence of the target object in cluttered environments. Subsequently, the framework consolidates the certified object detector with the certified pose estimator to design a multi-stage perception pipeline that generalizes the proposed approach to cluttered environments, while maintaining its certified guarantees. We evaluate the proposed framework using both synthetic and real images of various planar objects commonly encountered by autonomous vehicles. Using images captured by an event-based camera, we show that the trained encoder can effectively estimate the pose of a traffic sign in accordance with the certified bound provided by the framework.

翻译：本文研究自主系统的视觉位姿估计问题。尽管深度神经网络已成功应用于视觉任务，但其输出结果本质上缺乏可证明的正确性保证，这在安全关键型应用中至关重要。我们提出一个用于设计可认证神经网络（NN）的框架，该框架将基于物理的建模与基于学习的估计相结合，实现基于感知的位姿估计。该框架首先利用环境中常见平面物体（如交通标志和跑道标线，统称为目标物体）的已知几何结构。其核心是引入几何生成模型（GGM）——一种类神经网络模型，其参数源自相机观测目标物体的成像过程。一旦设计完成，GGM可用于训练具有估计误差认证保证的基于神经网络的位姿估计器。我们首先在无干扰环境中验证该框架，其中目标物体是相机视场内唯一存在的物体。随后结合神经网络可达性分析思想，设计能够检测干扰环境中目标物体存在的可认证目标检测神经网络。进而，该框架将可认证目标检测器与可认证位姿估计器整合，构建多级感知流水线，将所提方法推广至干扰环境，同时保持其认证保证。我们使用自动驾驶车辆常见各类平面物体的合成图像与真实图像对所提框架进行评估。通过事件相机捕获的图像，我们证明训练后的编码器能够根据框架提供的认证边界有效估计交通标志的位姿。