Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.
翻译:学习四维语言场以实现动态场景中时间敏感、开放式的语言查询,对许多现实应用至关重要。虽然LangSplat成功将CLIP特征嵌入到三维高斯表示中,在三维静态场景中实现了精确性与高效性,但其无法处理动态四维场,因为为静态图文任务设计的CLIP无法捕捉视频中的时序动态。真实环境本质上是动态的,物体语义随时间演变。构建精确的四维语言场需要获取像素对齐、对象级的视频特征,而当前视觉模型难以实现此目标。为应对这些挑战,我们提出4D LangSplat,它学习四维语言场以高效处理动态场景中时间无关或时间敏感的开放词汇查询。4D LangSplat绕过了从视觉特征学习语言场的传统路径,转而直接利用多模态大语言模型(MLLMs)根据对象级视频描述生成的文本来学习。具体而言,我们提出一种多模态对象级视频提示方法,该方法包含视觉提示和文本提示,引导MLLMs为视频中的对象生成详细、时序一致、高质量的描述。这些描述通过大语言模型编码为高质量的句子嵌入,随后作为像素对齐、对象特定的特征监督,通过共享嵌入空间促进开放词汇文本查询。考虑到四维场景中的对象状态呈现平滑过渡,我们进一步提出一种状态可变形网络来有效建模这些随时间发生的连续变化。我们在多个基准测试上的结果表明,4D LangSplat在时间敏感和时间无关的开放词汇查询任务中均取得了精确且高效的结果。