3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on https://github.com/Arlo0o/StereoScene.
翻译:三维语义场景补全(SSC)是一项病态感知任务,要求从有限观测中推断稠密的三维场景。已有基于相机的方法因固然的几何歧义性和不完整观测而难以预测精确的语义场景。本文利用立体匹配技术与鸟瞰视角(BEV)表示学习来解决SSC中的这些问题。立体匹配通过极线约束缓解几何歧义性,而BEV表示通过全局语义上下文增强对不可见区域的幻觉能力,两者互为补充。然而,由于立体几何与BEV特征之间存在固有的表示鸿沟,将其有效桥接用于SSC的稠密预测任务并非易事。为此,我们进一步提出统一的自占用框架BRGScene,该框架通过稠密三维体积有效桥接这两种表示以实现可靠的语义场景补全。具体而言,我们设计了一种新颖的互交互集成(MIE)模块,用于像素级可靠聚合立体几何与BEV特征。在MIE模块中,采用置信度重加权增强的双向可靠交互(BRI)模块,通过相互引导促进细粒度交互。此外,引入双体积集成(DVE)模块,通过通道级重校准与多组投票实现互补聚合。本方法在SemanticKITTI数据集的语义场景补全任务中优于所有已发表的基于相机的方法。代码已开源:https://github.com/Arlo0o/StereoScene。