Semantic-aware 3D reconstruction from sparse, unposed images remains challenging for feed-forward 3D Gaussian Splatting (3DGS). Existing methods often predict an over-complete set of Gaussian primitives under sparse-view supervision, leading to unstable geometry and inferior depth quality. Meanwhile, they rely solely on 2D segmenter features for semantic lifting, which provides weak 3D-level and limited generalizable supervision, resulting in incomplete 3D semantics in novel scenes. To address these issues, we propose UniSem, a unified framework that jointly improves depth accuracy and semantic generalization via two key components. First, Error-aware Gaussian Dropout (EGD) performs error-guided capacity control by suppressing redundancy-prone Gaussians using rendering error cues, producing meaningful, geometrically stable Gaussian representations for improved depth estimation. Second, we introduce a Mix-training Curriculum (MTC) that progressively blends 2D segmenter-lifted semantics with the model's own emergent 3D semantic priors, implemented with object-level prototype alignment to enhance semantic coherence and completeness. Extensive experiments on ScanNet and Replica show that UniSem achieves superior performance in depth prediction and open-vocabulary 3D segmentation across varying numbers of input views. Notably, with 16-view inputs, UniSem reduces depth Rel by 15.2% and improves open-vocabulary segmentation mAcc by 3.7% over strong baselines.
翻译:从稀疏、无位姿图像中进行语义感知的三维重建对于前馈式三维高斯溅射(3DGS)而言仍具挑战。现有方法通常在稀疏视角监督下预测一组过度完备的高斯基元,导致几何结构不稳定且深度质量较差。同时,这些方法仅依赖二维分割器特征进行语义提升,这提供了较弱的3D层面且泛化性有限的监督,导致在新场景中出现不完整的三维语义。为解决这些问题,我们提出了UniSem,一个通过两个关键组件联合提升深度精度与语义泛化能力的统一框架。首先,误差感知高斯丢弃(EGD)通过利用渲染误差线索抑制冗余倾向的高斯分布,执行误差引导的容量控制,从而为改进的深度估计生成有意义、几何稳定的高斯表示。其次,我们引入了混合训练课程(MTC),该策略逐步将二维分割器提升的语义与模型自身涌现的三维语义先验相融合,并通过对象级原型对齐实现,以增强语义的一致性与完整性。在ScanNet和Replica数据集上的大量实验表明,UniSem在不同输入视角数量的情况下,在深度预测和开放词汇三维分割方面均取得了优越性能。值得注意的是,在16个输入视角下,与强基线相比,UniSem将深度相对误差(Rel)降低了15.2%,并将开放词汇分割的平均准确率(mAcc)提升了3.7%。