Recently, several studies have combined Gaussian Splatting to obtain scene representations with language embeddings for open-vocabulary 3D scene understanding. While these methods perform well, they essentially require very dense multi-view inputs, limiting their applicability in real-world scenarios. In this work, we propose SparseLGS to address the challenge of 3D scene understanding with pose-free and sparse view input images. Our method leverages a learning-based dense stereo model to handle pose-free and sparse inputs, and a three-step region matching approach to address the multi-view semantic inconsistency problem, which is especially important for sparse inputs. Different from directly learning high-dimensional CLIP features, we extract low-dimensional information and build bijections to avoid excessive learning and storage costs. We introduce a reconstruction loss during semantic training to improve Gaussian positions and shapes. To the best of our knowledge, we are the first to address the 3D semantic field problem with sparse pose-free inputs. Experimental results show that SparseLGS achieves comparable quality when reconstructing semantic fields with fewer inputs (3-4 views) compared to previous SOTA methods with dense input. Besides, when using the same sparse input, SparseLGS leads significantly in quality and heavily improves the computation speed (5$\times$speedup). Project page: https://ustc3dv.github.io/SparseLGS
翻译:近期,多项研究将高斯溅射与语言嵌入相结合,以获得面向开放词汇三维场景理解的场景表示。尽管这些方法表现良好,但其本质上需要非常密集的多视角输入,限制了其在真实场景中的适用性。本工作提出SparseLGS,以应对无姿态且稀疏视角输入图像下的三维场景理解挑战。我们的方法利用基于学习的密集立体模型处理无姿态稀疏输入,并通过三步区域匹配方法解决多视角语义不一致问题,这对于稀疏输入尤为重要。与直接学习高维CLIP特征不同,我们提取低维信息并建立双射以避免过高的学习与存储开销。我们在语义训练中引入重建损失以优化高斯位置与形状。据我们所知,我们是首个针对稀疏无姿态输入解决三维语义场问题的方法。实验结果表明,与先前需要密集输入的SOTA方法相比,SparseLGS在使用更少输入(3-4个视角)重建语义场时能达到相当的质量。此外,在相同稀疏输入下,SparseLGS在质量上显著领先,并大幅提升计算速度(加速5倍)。项目页面:https://ustc3dv.github.io/SparseLGS