3D semantic field learning is crucial for applications like autonomous navigation, AR/VR, and robotics, where accurate comprehension of 3D scenes from limited viewpoints is essential. Existing methods struggle under sparse view conditions, relying on inefficient per-scene multi-view optimizations, which are impractical for many real-world tasks. To address this, we propose SLGaussian, a feed-forward method for constructing 3D semantic fields from sparse viewpoints, allowing direct inference of 3DGS-based scenes. By ensuring consistent SAM segmentations through video tracking and using low-dimensional indexing for high-dimensional CLIP features, SLGaussian efficiently embeds language information in 3D space, offering a robust solution for accurate 3D scene understanding under sparse view conditions. In experiments on two-view sparse 3D object querying and segmentation in the LERF and 3D-OVS datasets, SLGaussian outperforms existing methods in chosen IoU, Localization Accuracy, and mIoU. Moreover, our model achieves scene inference in under 30 seconds and open-vocabulary querying in just 0.011 seconds per query.
翻译:三维语义场学习对于自动驾驶导航、增强现实/虚拟现实(AR/VR)以及机器人技术等应用至关重要,这些应用需要从有限的视角准确理解三维场景。现有方法在稀疏视角条件下表现不佳,依赖于低效的逐场景多视角优化,这对于许多现实世界任务而言并不实用。为解决这一问题,我们提出了SLGaussian,一种从前馈方法,用于从稀疏视角构建三维语义场,允许直接推断基于3DGS的场景。通过视频跟踪确保一致的SAM分割,并对高维CLIP特征使用低维索引,SLGaussian能高效地将语言信息嵌入三维空间,为稀疏视角条件下的精确三维场景理解提供了一个鲁棒的解决方案。在LERF和3D-OVS数据集上的两视角稀疏三维物体查询与分割实验中,SLGaussian在选定的交并比(IoU)、定位精度(Localization Accuracy)和平均交并比(mIoU)指标上均优于现有方法。此外,我们的模型在30秒内即可完成场景推断,并以每查询仅0.011秒的速度实现开放词汇查询。