World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

from arxiv, The code is available at https://github.com/WanyueZhang-ai/World2VLM. The dataset is available at https://huggingface.co/datasets/WanyueZhang/World2VLM

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and inverse (outcome-to-action) spatial reasoning. We post-train the VLM with a two-stage recipe on a compact dataset generated by this pipeline and evaluate it on multiple spatial reasoning benchmarks. World2VLM delivers consistent improvements over the base model across diverse benchmarks, including SAT-Real, SAT-Synthesized, VSI-Bench, and MindCube. It also outperforms the test-time world-model-coupled methods while eliminating the need for expensive inference-time generation. Our results suggest that world models can serve not only as inference-time tools, but also as effective training-time teachers, enabling VLMs to internalize spatial imagination in a scalable and efficient manner.

翻译：视觉语言模型在静态视觉理解任务上表现出色，但在需要想象场景如何随自驱动运动演化的动态空间推理方面仍存在显著不足。现有工作通过两种途径解决该问题：利用合成数据增强空间监督信号，或在推理阶段将视觉语言模型与世界模型耦合。然而前者缺乏对运动条件状态迁移的显式建模，后者则会产生巨大计算开销。本研究提出World2VLM训练框架，从生成式世界模型向视觉语言模型蒸馏空间想象能力。给定初始观测与参数化相机轨迹，我们采用视角一致的世界模型合成几何对齐的未来视角图像，并推导出面向正向推理（动作到结果）与逆向推理（结果到动作）的结构化监督信号。通过两阶段微调策略，在该流程生成的紧凑数据集上对视觉语言模型进行后训练，并在多个空间推理基准上开展评估。World2VLM在SAT-Real、SAT-Synthesized、VSI-Bench和MindCube等多样化基准测试中均实现优于基线的性能提升。相较于测试时耦合世界模型的方法，本方法在消除昂贵推理生成开销的同时展现出更优性能。实验结果表明，世界模型不仅可作为推理工具，更能充当高效的训练导师，使视觉语言模型以可扩展的方式内化空间想象能力。