Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.
翻译:近期潜在世界模型(如V-JEPA2)的进展表明,其从视频观测中预测未来世界状态的能力显著提升。然而,基于短时观测窗口的密集预测限制了时间上下文,并可能使预测器偏向局部、低层级外推,导致难以捕捉长程语义并降低下游任务的实用性。相比之下,视觉-语言模型通过对均匀采样帧进行推理,提供了强大的语义基础与通用知识;但由于计算驱动的稀疏采样、语言输出瓶颈(将细粒度交互状态压缩为文本导向表示)以及适应小规模动作条件数据集时的数据分布不匹配,其难以独立作为密集预测器。本文提出一种VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结合密集帧动态建模与长程语义引导:密集JEPA分支负责捕捉细粒度运动与交互线索,而均匀采样的VLM"思考者"分支则以更大时间步长提供知识丰富的引导。为高效传递VLM的渐进推理信号,我们引入分层金字塔表示提取模块,将VLM多层表示聚合为与潜在预测兼容的引导特征。在手部操作轨迹预测实验表明,本方法在强VLM基线及JEPA预测基线基础上均实现性能提升,并展现出更鲁棒的长程推演行为。