NeRF-Det++: Incorporating Semantic Cues and Perspective-aware Depth Supervision for Indoor Multi-View 3D Detection

NeRF-Det has achieved impressive performance in indoor multi-view 3D detection by innovatively utilizing NeRF to enhance representation learning. Despite its notable performance, we uncover three decisive shortcomings in its current design, including semantic ambiguity, inappropriate sampling, and insufficient utilization of depth supervision. To combat the aforementioned problems, we present three corresponding solutions: 1) Semantic Enhancement. We project the freely available 3D segmentation annotations onto the 2D plane and leverage the corresponding 2D semantic maps as the supervision signal, significantly enhancing the semantic awareness of multi-view detectors. 2) Perspective-aware Sampling. Instead of employing the uniform sampling strategy, we put forward the perspective-aware sampling policy that samples densely near the camera while sparsely in the distance, more effectively collecting the valuable geometric clues. 3)Ordinal Residual Depth Supervision. As opposed to directly regressing the depth values that are difficult to optimize, we divide the depth range of each scene into a fixed number of ordinal bins and reformulate the depth prediction as the combination of the classification of depth bins as well as the regression of the residual depth values, thereby benefiting the depth learning process. The resulting algorithm, NeRF-Det++, has exhibited appealing performance in the ScanNetV2 and ARKITScenes datasets. Notably, in ScanNetV2, NeRF-Det++ outperforms the competitive NeRF-Det by +1.9% in [email protected] and +3.5% in [email protected]$. The code will be publicly at https://github.com/mrsempress/NeRF-Detplusplus.

翻译：摘要：NeRF-Det通过创新性地利用神经辐射场（NeRF）增强表示学习，已在室内多视角三维检测领域取得显著性能。尽管性能卓越，我们揭示其当前设计存在三个关键缺陷：语义模糊性、采样策略不合理以及深度监督利用不充分。针对上述问题，我们提出三项对应解决方案：1）语义增强。将易于获取的三维分割标注投影至二维平面，并利用对应二维语义图作为监督信号，显著提升多视角检测器的语义感知能力。2）透视感知采样。摒弃均匀采样策略，提出透视感知采样策略——近相机区域密集采样、远距离区域稀疏采样，从而更高效地采集有价值的几何线索。3）序数残差深度监督。针对直接回归深度值难以优化的问题，将各场景深度范围划分为固定数量的序数区间，将深度预测重构为深度区间分类与残差深度值回归的组合，进而优化深度学习过程。由此提出的算法NeRF-Det++在ScanNetV2与ARKITScenes数据集上展现出优异性能。值得注意的是，在ScanNetV2数据集中，NeRF-Det++相比具有竞争力的NeRF-Det，[email protected]提升1.9%、[email protected]提升3.5%。代码将开源至https://github.com/mrsempress/NeRF-Detplusplus。