LangSplat: 3D Language Gaussian Splatting

Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat, which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model, LangSplat advances the field by utilizing a collection of 3D Gaussians, each encoding language features distilled from CLIP, to represent the language field. By employing a tile-based splatting technique for rendering language features, we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings, LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space, thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields, which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM, thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably, LangSplat is extremely efficient, achieving a 199 $\times$ speedup compared to LERF at the resolution of 1440 $\times$ 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/

翻译：人类生活于三维世界，并常以自然语言与三维场景交互。近年来，构建支持三维空间中开放语言查询的三维语言场日益受到关注。本文提出LangSplat，该方法构建了一个可在三维空间内实现精确高效开放词汇查询的语言场。与现有基于NeRF模型嵌入CLIP语言嵌入的方法不同，LangSplat采用一组三维高斯体（每个高斯体编码从CLIP中蒸馏的语言特征）来表征语言场。通过使用基于图块（tile）的泼溅技术渲染语言特征，我们规避了NeRF中高成本的渲染过程。不同于直接学习CLIP嵌入，LangSplat首先训练场景级语言自编码器，进而于场景特定潜在空间中学习语言特征，从而缓解显式建模带来的巨大内存需求。现有方法难以处理不精确且模糊的三维语言场，无法清晰区分物体边界。我们深入探究此问题，并提出利用SAM学习层次化语义，避免在不同尺度上频繁查询语言场以及对DINO特征进行正则化的需求。大量实验结果表明，LangSplat在性能上显著超越先前最先进方法LERF。值得注意的是，LangSplat极为高效，在1440×1080分辨率下较LERF实现199倍加速。强烈建议读者观看我们在https://langsplat.github.io/ 上的视频结果。