Recent advancements in 2D and multimodal models have achieved remarkable success by leveraging large-scale training on extensive datasets. However, extending these achievements to enable free-form interactions and high-level semantic operations with complex 3D/4D scenes remains challenging. This difficulty stems from the limited availability of large-scale, annotated 3D/4D or multi-view datasets, which are crucial for generalizable vision and language tasks such as open-vocabulary and prompt-based segmentation, language-guided editing, and visual question answering (VQA). In this paper, we introduce Feature4X, a universal framework designed to extend any functionality from 2D vision foundation model into the 4D realm, using only monocular video input, which is widely available from user-generated content. The "X" in Feature4X represents its versatility, enabling any task through adaptable, model-conditioned 4D feature field distillation. At the core of our framework is a dynamic optimization strategy that unifies multiple model capabilities into a single representation. Additionally, to the best of our knowledge, Feature4X is the first method to distill and lift the features of video foundation models (e.g. SAM2, InternVideo2) into an explicit 4D feature field using Gaussian Splatting. Our experiments showcase novel view segment anything, geometric and appearance scene editing, and free-form VQA across all time steps, empowered by LLMs in feedback loops. These advancements broaden the scope of agentic AI applications by providing a foundation for scalable, contextually and spatiotemporally aware systems capable of immersive dynamic 4D scene interaction.
翻译:近年来,2D与多模态模型通过在大规模数据集上进行广泛训练取得了显著成功。然而,将这些成果扩展到支持对复杂3D/4D场景的自由交互与高级语义操作仍面临挑战。这一困难主要源于大规模标注3D/4D或多视角数据集的稀缺性,而此类数据对于开放词汇与基于提示的分割、语言引导编辑、视觉问答等可泛化的视觉与语言任务至关重要。本文提出Feature4X,一个通用框架,旨在仅利用用户生成内容中广泛可得的单目视频输入,将任意2D视觉基础模型的功能扩展至4D领域。Feature4X中的“X”代表其多功能性,可通过适配的模型条件化4D特征场蒸馏实现任意任务。我们框架的核心是一种动态优化策略,能够将多种模型能力统一到单一表征中。此外,据我们所知,Feature4X是首个利用高斯泼溅将视频基础模型(如SAM2、InternVideo2)的特征蒸馏并提升为显式4D特征场的方法。实验展示了在反馈循环中借助大语言模型实现的新视角任意分割、几何与外观场景编辑,以及跨所有时间步的自由形式视觉问答。这些进展通过为可扩展、具备上下文与时空感知能力的沉浸式动态4D场景交互系统奠定基础,拓宽了智能体人工智能的应用范围。