FutureVLA: Joint Visuomotor Prediction for Vision-Language-Action Model

Predictive foresight is important to intelligent embodied agents. Since the motor execution of a robot is intrinsically constrained by its visual perception of environmental geometry, effectively anticipating the future requires capturing this tightly coupled visuomotor interplay. While recent vision-language-action models attempt to incorporate future guidance, they struggle with this joint modeling. Existing explicit methods divert capacity to task-irrelevant visual details, whereas implicit methods relying on sparse frame pairs disrupt temporal continuity. By heavily relying on visual reconstruction, these methods become visually dominated, entangling static scene context with dynamic action intent. We argue that effective joint visuomotor predictive modeling requires both temporal continuity and visually-conditioned supervision decoupling. To this end, we propose FutureVLA, featuring a novel Joint Visuomotor Predictive Architecture. FutureVLA is designed to extract joint visuomotor embeddings by first decoupling visual and motor information, and then jointly encoding generalized physical priors. Specifically, in the pretraining stage, we leverage heterogeneous manipulation datasets and introduce a Joint Visuomotor Gating mechanism to structurally separate visual state preservation from temporal action modeling. It allows the motor stream to focus on continuous physical dynamics while explicitly querying visual tokens for environmental constraints, yielding highly generalizable joint visuomotor embeddings. Subsequently, in the post-training stage, we employ a latent embeddings alignment strategy, enabling diverse downstream VLA models to internalize these temporal priors without modifying their inference architectures. Extensive experiments demonstrate that FutureVLA consistently improves VLA frameworks.

翻译：预测性前瞻对于智能具身智能体至关重要。由于机器人的运动执行本质上受其环境几何视觉感知的约束，有效预测未来需要捕捉这种紧密耦合的视觉-运动交互作用。尽管当前的视觉-语言-动作模型尝试融入未来指导，但它们在此联合建模方面存在困难。现有的显式方法将模型容量分散到与任务无关的视觉细节上，而依赖稀疏帧对的隐式方法则破坏了时间连续性。由于过度依赖视觉重建，这些方法变得视觉主导，将静态场景上下文与动态动作意图纠缠在一起。我们认为，有效的联合视觉-运动预测建模需要同时具备时间连续性和视觉条件监督解耦。为此，我们提出了FutureVLA，其采用一种新颖的联合视觉-运动预测架构。FutureVLA旨在通过首先解耦视觉与运动信息，然后联合编码广义物理先验，来提取联合视觉-运动嵌入。具体而言，在预训练阶段，我们利用异构操作数据集，并引入一种联合视觉-运动门控机制，以在结构上将视觉状态保持与时间动作建模分离。这使得运动流能够专注于连续的物理动态，同时显式查询视觉令牌以获取环境约束，从而产生高度可泛化的联合视觉-运动嵌入。随后，在后训练阶段，我们采用一种潜在嵌入对齐策略，使多样化的下游VLA模型能够内化这些时间先验，而无需修改其推理架构。大量实验表明，FutureVLA能持续改进各类VLA框架。