We present a new learning-based framework S-3D-RCNN that can recover accurate object orientation in SO(3) and simultaneously predict implicit rigid shapes from stereo RGB images. For orientation estimation, in contrast to previous studies that map local appearance to observation angles, we propose a progressive approach by extracting meaningful Intermediate Geometrical Representations (IGRs). This approach features a deep model that transforms perceived intensities from one or two views to object part coordinates to achieve direct egocentric object orientation estimation in the camera coordinate system. To further achieve finer description inside 3D bounding boxes, we investigate the implicit shape estimation problem from stereo images. We model visible object surfaces by designing a point-based representation, augmenting IGRs to explicitly address the unseen surface hallucination problem. Extensive experiments validate the effectiveness of the proposed IGRs, and S-3D-RCNN achieves superior 3D scene understanding performance. We also designed new metrics on the KITTI benchmark for our evaluation of implicit shape estimation.
翻译:我们提出了一种新的基于学习的框架S-3D-RCNN,该框架能够从立体RGB图像中恢复SO(3)空间内精确的物体朝向,并同时预测隐式刚性形状。在朝向估计方面,与先前将局部外观映射到观测角度的研究不同,我们提出了一种通过提取有意义的中间几何表示(IGRs)的渐进式方法。该方法的核心是一个深度模型,它将单视图或双视图感知的强度信息转换为物体部件坐标,从而在相机坐标系中实现直接以自我为中心的物体朝向估计。为了进一步实现三维边界框内部更精细的描述,我们研究了从立体图像进行隐式形状估计的问题。我们通过设计一种基于点的表示来建模可见物体表面,并增强IGRs以显式地解决不可见表面的幻觉问题。大量实验验证了所提出的IGRs的有效性,并且S-3D-RCNN实现了卓越的三维场景理解性能。我们还在KITTI基准上设计了新的指标,用于评估隐式形状估计。