4D LangSplat：基于多模态大语言模型的四维语言高斯溅射 (4D LangSplat: 4D Language Gaussian Splatting via Multimodal Large Language Models)

Learning 4D language fields to enable time-sensitive, open-ended language queries in dynamic scenes is essential for many real-world applications. While LangSplat successfully grounds CLIP features into 3D Gaussian representations, achieving precision and efficiency in 3D static scenes, it lacks the ability to handle dynamic 4D fields as CLIP, designed for static image-text tasks, cannot capture temporal dynamics in videos. Real-world environments are inherently dynamic, with object semantics evolving over time. Building a precise 4D language field necessitates obtaining pixel-aligned, object-wise video features, which current vision models struggle to achieve. To address these challenges, we propose 4D LangSplat, which learns 4D language fields to handle time-agnostic or time-sensitive open-vocabulary queries in dynamic scenes efficiently. 4D LangSplat bypasses learning the language field from vision features and instead learns directly from text generated from object-wise video captions via Multimodal Large Language Models (MLLMs). Specifically, we propose a multimodal object-wise video prompting method, consisting of visual and text prompts that guide MLLMs to generate detailed, temporally consistent, high-quality captions for objects throughout a video. These captions are encoded using a Large Language Model into high-quality sentence embeddings, which then serve as pixel-aligned, object-specific feature supervision, facilitating open-vocabulary text queries through shared embedding spaces. Recognizing that objects in 4D scenes exhibit smooth transitions across states, we further propose a status deformable network to model these continuous changes over time effectively. Our results across multiple benchmarks demonstrate that 4D LangSplat attains precise and efficient results for both time-sensitive and time-agnostic open-vocabulary queries.

翻译：学习四维语言场以实现动态场景中时间敏感、开放式的语言查询，对许多现实应用至关重要。虽然LangSplat成功将CLIP特征嵌入到三维高斯表示中，在三维静态场景中实现了精确性与高效性，但其无法处理动态四维场，因为为静态图文任务设计的CLIP无法捕捉视频中的时序动态。真实环境本质上是动态的，物体语义随时间演变。构建精确的四维语言场需要获取像素对齐、对象级的视频特征，而当前视觉模型难以实现此目标。为应对这些挑战，我们提出4D LangSplat，它学习四维语言场以高效处理动态场景中时间无关或时间敏感的开放词汇查询。4D LangSplat绕过了从视觉特征学习语言场的传统路径，转而直接利用多模态大语言模型（MLLMs）根据对象级视频描述生成的文本来学习。具体而言，我们提出一种多模态对象级视频提示方法，该方法包含视觉提示和文本提示，引导MLLMs为视频中的对象生成详细、时序一致、高质量的描述。这些描述通过大语言模型编码为高质量的句子嵌入，随后作为像素对齐、对象特定的特征监督，通过共享嵌入空间促进开放词汇文本查询。考虑到四维场景中的对象状态呈现平滑过渡，我们进一步提出一种状态可变形网络来有效建模这些随时间发生的连续变化。我们在多个基准测试上的结果表明，4D LangSplat在时间敏感和时间无关的开放词汇查询任务中均取得了精确且高效的结果。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日