3D object detection from visual sensors is a cornerstone capability of robotic systems. State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. In this work we gain intuition from the integral role of multi-view consistency in 3D scene understanding and geometric learning. To this end, we introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry to improve localization through viewpoint awareness and equivariance. VEDet leverages a query-based transformer architecture and encodes the 3D scene by augmenting image features with positional encodings from their 3D perspective geometry. We design view-conditioned queries at the output level, which enables the generation of multiple virtual frames during training to learn viewpoint equivariance by enforcing multi-view consistency. The multi-view geometry injected at the input level as positional encodings and regularized at the loss level provides rich geometric cues for 3D object detection, leading to state-of-the-art performance on the nuScenes benchmark. The code and model are made available at https://github.com/TRI-ML/VEDet.
翻译:视觉传感器中的3D物体检测是机器人系统的核心能力。当前最先进方法聚焦于从多视角摄像头输入中推理并解码物体边界框。本文从多视角一致性在3D场景理解与几何学习中的关键作用中获得启发,提出VEDet——一种新颖的3D物体检测框架,通过利用3D多视角几何提升定位的视角感知与等变性。VEDet采用基于查询的Transformer架构,通过将图像特征与其3D透视几何的位置编码相结合来编码3D场景。我们在输出层设计视角条件化查询,从而在训练过程中生成多个虚拟帧,通过强制多视角一致性来学习视角等变性。在输入层以位置编码注入的多视角几何,以及在损失层进行的正则化,为3D物体检测提供了丰富的几何线索,在nuScenes基准上取得了最先进性能。代码与模型已开源至https://github.com/TRI-ML/VEDet。