This paper introduces CN-RMA, a novel approach for 3D indoor object detection from multi-view images. We observe the key challenge as the ambiguity of image and 3D correspondence without explicit geometry to provide occlusion information. To address this issue, CN-RMA leverages the synergy of 3D reconstruction networks and 3D object detection networks, where the reconstruction network provides a rough Truncated Signed Distance Function (TSDF) and guides image features to vote to 3D space correctly in an end-to-end manner. Specifically, we associate weights to sampled points of each ray through ray marching, representing the contribution of a pixel in an image to corresponding 3D locations. Such weights are determined by the predicted signed distances so that image features vote only to regions near the reconstructed surface. Our method achieves state-of-the-art performance in 3D object detection from multi-view images, as measured by [email protected] and [email protected] on the ScanNet and ARKitScenes datasets. The code and models are released at https://github.com/SerCharles/CN-RMA.
翻译:本文提出CN-RMA,一种面向多视角图像的三维室内目标检测新方法。我们观察到关键挑战在于缺乏显式几何提供遮挡信息时图像与三维空间对应关系的模糊性。为解决此问题,CN-RMA利用三维重建网络与三维目标检测网络的协同作用,其中重建网络提供粗略的截断符号距离函数(TSDF),并以端到端方式引导图像特征正确投票至三维空间。具体而言,我们通过光线行进为每条光线的采样点关联权重,表征图像中像素对对应三维位置的贡献度。该权重由预测的符号距离决定,使图像特征仅投票至重建表面附近区域。在ScanNet和ARKitScenes数据集上,以[email protected]和[email protected]为评估指标,本方法在多视角图像三维目标检测任务中达到当前最优性能。代码与模型已发布于https://github.com/SerCharles/CN-RMA。