Monocular 3D Semantic Scene Completion (SSC) is a challenging yet promising task that aims to infer dense geometric and semantic descriptions of a scene from a single image. While recent object-centric paradigms significantly improve efficiency by leveraging flexible 3D Gaussian primitives, they still rely heavily on a large number of randomly initialized primitives, which inevitably leads to 1) inefficient primitive initialization and 2) outlier primitives that introduce erroneous artifacts. In this paper, we propose SplatSSC, a novel framework that resolves these limitations with a depth-guided initialization strategy and a principled Gaussian aggregator. Instead of random initialization, SplatSSC utilizes a dedicated depth branch composed of a Group-wise Multi-scale Fusion (GMF) module, which integrates multi-scale image and depth features to generate a sparse yet representative set of initial Gaussian primitives. To mitigate noise from outlier primitives, we develop the Decoupled Gaussian Aggregator (DGA), which enhances robustness by decomposing geometric and semantic predictions during the Gaussian-to-voxel splatting process. Complemented with a specialized Probability Scale Loss, our method achieves state-of-the-art performance on the Occ-ScanNet dataset, outperforming prior approaches by over 6.3% in IoU and 4.1% in mIoU, while reducing both latency and memory cost by more than 9.3%.
翻译:单目三维语义场景补全是一项具有挑战性但前景广阔的任务,其目标是从单张图像推断场景的稠密几何与语义描述。虽然近期以对象为中心的范式通过利用灵活的三维高斯基元显著提升了效率,但它们仍然严重依赖大量随机初始化的基元,这不可避免地导致:1)基元初始化效率低下;2)异常基元引入错误伪影。本文提出SplatSSC,一个新颖的框架,通过深度引导初始化策略和原理性高斯聚合器解决了这些局限。SplatSSC摒弃随机初始化,采用一个由分组多尺度融合模块构成的专用深度分支,该模块整合多尺度图像与深度特征,以生成一组稀疏但具代表性的初始高斯基元。为减轻异常基元带来的噪声,我们开发了解耦高斯聚合器,它通过在高斯到体素溅射过程中分解几何与语义预测来增强鲁棒性。结合专门设计的概率尺度损失,我们的方法在Occ-ScanNet数据集上实现了最先进的性能,在IoU和mIoU指标上分别以超过6.3%和4.1%的优势超越先前方法,同时将延迟和内存成本降低了超过9.3%。