Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on \url{https://github.com/Arlo0o/StereoScene}.

翻译：三维语义场景补全（SSC）是一项病态感知任务，要求从有限观测中推断出稠密的三维场景。现有基于相机的方法因固有的几何模糊性和不完整观测，难以准确预测语义场景。本文利用立体匹配技术与鸟瞰图（BEV）表示学习来应对SSC中的这些挑战。二者相互补充：立体匹配通过极线约束缓解几何模糊性，而BEV表示利用全局语义上下文增强不可见区域的补全能力。然而，由于立体几何与BEV特征之间固有的表示差异，将二者桥接用于SSC的稠密预测任务并非易事。为此，我们进一步提出统一基于占用率的框架BRGScene，该框架通过稠密三维体有效桥接这两种表示，实现可靠的语义场景补全。具体而言，我们设计了一种新颖的互交互集成（MIE）模块，用于像素级可靠聚合立体几何与BEV特征。在MIE模块中，采用增强置信度重加权的双向可靠交互（BRI）模块，通过相互引导促进细粒度交互。此外，引入双体积集成（DVE）模块，通过通道级重校准与多组投票实现互补聚合。我们的方法在SemanticKITTI场景补全任务上优于所有已发表的基于相机的方法。代码开源在\url{https://github.com/Arlo0o/StereoScene}。