While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.
翻译:尽管多模态大语言模型展现出令人瞩目的语义能力,但它们常受空间盲点所困,难以进行细粒度几何推理和物理动力学理解。现有解决方案通常依赖显式三维模态或复杂几何架构,受限于数据稀缺和泛化挑战。本研究提出范式转变,通过利用大规模视频生成模型中的隐式空间先验。我们提出假设:为合成时间连贯视频,这些模型内在地学习了鲁棒的三维结构先验和物理定律。我们引入VEGA-3D(视频提取的生成感知),一个即插即用框架,将预训练视频扩散模型重新用作潜在世界模拟器。通过从中间噪声水平提取时空特征,并借助令牌级自适应门控融合机制将其与语义表征整合,我们无需显式三维监督即可为MLLMs注入密集几何线索。在三维场景理解、空间推理和具身操作基准测试上的大量实验表明,我们的方法优于现有最优基线,验证了生成先验为物理世界理解提供了可扩展基础。代码已开源:https://github.com/H-EmbodVis/VEGA-3D。