Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion

3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations. Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations. In this paper, we resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC. Complementary to each other, stereo matching mitigates geometric ambiguity with epipolar constraint while BEV representation enhances the hallucination ability for invisible regions with global semantic context. However, due to the inherent representation gap between stereo geometry and BEV features, it is non-trivial to bridge them for dense prediction task of SSC. Therefore, we further develop a unified occupancy-based framework dubbed BRGScene, which effectively bridges these two representations with dense 3D volumes for reliable semantic scene completion. Specifically, we design a novel Mutual Interactive Ensemble (MIE) block for pixel-level reliable aggregation of stereo geometry and BEV features. Within the MIE block, a Bi-directional Reliable Interaction (BRI) module, enhanced with confidence re-weighting, is employed to encourage fine-grained interaction through mutual guidance. Besides, a Dual Volume Ensemble (DVE) module is introduced to facilitate complementary aggregation through channel-wise recalibration and multi-group voting. Our method outperforms all published camera-based methods on SemanticKITTI for semantic scene completion. Our code is available on \url{https://github.com/Arlo0o/StereoScene}.

翻译：3D语义场景补全（SSC）是一项病态感知任务，需从有限观测中推断密集三维场景。现有基于相机的方法因固化的几何模糊性和不完整观测，难以预测准确语义场景。本文借助立体匹配技术与鸟瞰视角（BEV）表示学习来解决SSC中的上述问题。立体匹配通过极线约束缓解几何模糊性，而BEV表示则通过全局语义上下文增强不可见区域的幻觉能力，二者相互补充。然而，由于立体几何与BEV特征间存在固有表示鸿沟，难以将其有效桥接以完成SSC的密集预测任务。为此，我们进一步提出了名为BRGScene的统一占用率框架，通过密集三维体有效融合这两种表示以实现可靠语义场景补全。具体而言，我们设计了一种新颖的互交互集成（MIE）模块，用于像素级可靠聚合立体几何与BEV特征。在MIE模块中，引入置信度重加权增强的双向可靠交互（BRI）模块，通过互导机制促进细粒度交互。同时，引入双体积集成（DVE）模块，通过通道级重新校准与多组投票实现互补聚合。该方法在SemanticKITTI数据集上的语义场景补全任务中超越所有已发表的基于相机的方法。代码已开源至\url{https://github.com/Arlo0o/StereoScene}。