High-quality 3D world models are pivotal for embodied intelligence and Artificial General Intelligence (AGI), underpinning applications such as AR/VR content creation and robotic navigation. Despite the established strong imaginative priors, current video foundation models lack explicit 3D grounding capabilities, thus being limited in both spatial consistency and their utility for downstream 3D reasoning tasks. In this work, we present FantasyWorld, a geometry-enhanced framework that augments frozen video foundation models with a trainable geometric branch, enabling joint modeling of video latents and an implicit 3D field in a single forward pass. Our approach introduces cross-branch supervision, where geometry cues guide video generation and video priors regularize 3D prediction, thus yielding consistent and generalizable 3D-aware video representations. Notably, the resulting latents from the geometric branch can potentially serve as versatile representations for downstream 3D tasks such as novel view synthesis and navigation, without requiring per-scene optimization or fine-tuning. Extensive experiments show that FantasyWorld effectively bridges video imagination and 3D perception, outperforming recent geometry-consistent baselines in multi-view coherence and style consistency. Ablation studies further confirm that these gains stem from the unified backbone and cross-branch information exchange.
翻译:高质量的三维世界模型对于具身智能与通用人工智能(AGI)至关重要,支撑着如增强现实/虚拟现实内容创作与机器人导航等应用。尽管已具备强大的想象先验,当前的视频基础模型缺乏显式的三维基础能力,因此在空间一致性及对下游三维推理任务的实用性方面均受到限制。本文提出FantasyWorld,一种几何增强框架,通过可训练的几何分支增强冻结的视频基础模型,实现在单次前向传播中联合建模视频潜在表示与隐式三维场。我们的方法引入了跨分支监督,其中几何线索指导视频生成,视频先验正则化三维预测,从而产生一致且可泛化的三维感知视频表示。值得注意的是,几何分支生成的潜在表示可作为下游三维任务(如新视角合成与导航)的通用表示,无需进行逐场景优化或微调。大量实验表明,FantasyWorld有效桥接了视频想象与三维感知,在多视角一致性与风格一致性方面优于近期几何一致的基线模型。消融研究进一步证实,这些性能提升源于统一主干网络与跨分支信息交换。