Vision-based occupancy prediction, also known as 3D Semantic Scene Completion (SSC), presents a significant challenge in computer vision. Previous methods, confined to onboard processing, struggle with simultaneous geometric and semantic estimation, continuity across varying viewpoints, and single-view occlusion. Our paper introduces OccFiner, a novel offboard framework designed to enhance the accuracy of vision-based occupancy predictions. OccFiner operates in two hybrid phases: 1) a multi-to-multi local propagation network that implicitly aligns and processes multiple local frames for correcting onboard model errors and consistently enhancing occupancy accuracy across all distances. 2) the region-centric global propagation, focuses on refining labels using explicit multi-view geometry and integrating sensor bias, especially to increase the accuracy of distant occupied voxels. Extensive experiments demonstrate that OccFiner improves both geometric and semantic accuracy across various types of coarse occupancy, setting a new state-of-the-art performance on the SemanticKITTI dataset. Notably, OccFiner elevates vision-based SSC models to a level even surpassing that of LiDAR-based onboard SSC models.
翻译:基于视觉的占据空间预测,又称3D语义场景补全(SSC),是计算机视觉领域的一项重大挑战。此前受限于车载处理的方法,难以同时完成几何与语义估计、跨越不同视角的连续性以及单视角遮挡等任务。本文提出的OccFiner是一个新型离线框架,旨在提升基于视觉的占据空间预测精度。OccFiner采用两个混合阶段:1)多对多局部传播网络,通过隐式对齐并处理多个局部帧,修正车载模型误差,并在全距离范围内持续提升占据空间精度;2)区域中心全局传播,利用显式多视图几何与集成传感器偏置来细化标注,尤其提升远距离占据体素的准确性。大量实验表明,OccFiner能在多种粗粒度占据空间上同时提升几何与语义精度,在SemanticKITTI数据集上创下最新最优性能。值得注意的是,OccFiner使基于视觉的SSC模型达到甚至超越基于激光雷达的车载SSC模型水平。