ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

翻译：潜在世界模型（如V-JEPA2）的最新进展已展现出从视频观测中预测未来世界状态的强大能力。然而，基于短时观测窗口的密集预测限制了时间上下文，并容易使预测器偏向局部、低层次的外推，导致难以捕捉长期语义信息并降低下游任务效用。相比之下，视觉-语言模型通过均匀采样帧进行推理，提供了强大的语义基础与通用知识，但由于计算驱动的稀疏采样、将细粒度交互状态压缩为文本导向表示的语言输出瓶颈，以及在小规模动作条件数据集上的数据分布不匹配，其作为独立密集预测器并不理想。我们提出一种VLM引导的JEPA式潜在世界建模框架，通过双重时间路径将密集帧动态建模与长期语义引导相结合：一个密集JEPA分支用于捕捉细粒度运动与交互线索，一个均匀采样的VLM“思考者”分支（具有更大时间步长）用于提供富含知识的引导。为有效传递VLM的渐进式推理信号，我们引入分层金字塔表示提取模块，将VLM的多层表示聚合为与潜在预测兼容的引导特征。在手工操作轨迹预测实验上，我们的方法不仅优于强大的纯VLM基线和JEPA预测器基线，还展现出更鲁棒的长期展开行为。