3D object detection from visual sensors is a cornerstone capability of robotic systems. State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. In this work we gain intuition from the integral role of multi-view consistency in 3D scene understanding and geometric learning. To this end, we introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry to improve localization through viewpoint awareness and equivariance. VEDet leverages a query-based transformer architecture and encodes the 3D scene by augmenting image features with positional encodings from their 3D perspective geometry. We design view-conditioned queries at the output level, which enables the generation of multiple virtual frames during training to learn viewpoint equivariance by enforcing multi-view consistency. The multi-view geometry injected at the input level as positional encodings and regularized at the loss level provides rich geometric cues for 3D object detection, leading to state-of-the-art performance on the nuScenes benchmark. The code and model are made available at https://github.com/TRI-ML/VEDet.
翻译:视觉传感器的3D目标检测是机器人系统的核心能力。现有前沿方法聚焦于从多视角相机输入中推理并解码目标边界框。本研究从多视角一致性在3D场景理解与几何学习中的核心作用中获取直觉,由此提出VEDet——一种利用3D多视角几何通过视角感知与等变性提升定位精度的新型3D目标检测框架。VEDet采用基于查询的Transformer架构,通过将图像特征与其3D透视几何的位置编码相结合来编码3D场景。我们在输出层设计视角条件化查询,使得训练过程中能生成多个虚拟坐标系,通过强制多视角一致性来学习视角等变性。输入层注入的多视角几何信息以位置编码形式存在,损失层则通过正则化提供丰富的几何线索以辅助3D目标检测,最终在nuScenes基准上达到领先性能。代码与模型已在https://github.com/TRI-ML/VEDet开源。