LangSplat: 3D Language Gaussian Splatting

Human lives in a 3D world and commonly uses natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experiments on open-vocabulary 3D object localization and semantic segmentation demonstrate that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a {\speed} $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io

翻译：人类生活在三维世界中，并常通过自然语言与三维场景交互。近年来，构建支持三维空间中开放式语言查询的3D语言场域引起了广泛关注。本文提出LangSplat，通过构建精确高效的3D语言场域，实现三维空间内的开放词汇查询。与现有方法将CLIP语言嵌入映射到NeRF模型不同，LangSplat利用一组编码从CLIP提炼的语言特征的三维高斯体来表示语言场域，推动了该领域发展。通过采用基于瓦片泼溅技术渲染语言特征，我们避免了NeRF中高成本渲染过程。LangSplat不直接学习CLIP嵌入，而是首先训练场景级语言自编码器，随后在场景特定隐空间中学习语言特征，从而减轻显式建模带来的巨大内存需求。现有方法难以避免模糊不清的3D语言场域，无法区分物体间的清晰边界。我们深入探究此问题，提出利用SAM学习层次化语义，无需在不同尺度上频繁查询语言场域及对DINO特征进行正则化。在开放词汇3D目标定位和语义分割上的大量实验表明，LangSplat显著超越了先前最先进方法LERF。值得注意的是，LangSplat效率极高，在1440×1080分辨率下相比LERF实现了{\speed} $\times$的速度提升。我们强烈建议读者访问https://langsplat.github.io查看我们的视频结果。