Enabling VLA models to predict environmental dynamics, known as world modeling, has been recognized as essential for improving robotic reasoning and generalization. However, current approaches face two main issues: 1. The training objective forces models to over-emphasize pixel-level reconstruction, which constrains semantic learning and generalization 2. Reliance on predicted future observations during inference often leads to error accumulation. To address these challenges, we introduce Future Representation Alignment via Parallel Progressive Expansion (FRAPPE). Our method adopts a two-stage fine-tuning strategy: In the mid-training phase, the model learns to predict the latent representations of future observations; In the post-training phase, we expand the computational workload in parallel and align the representation simultaneously with multiple different visual foundation models. By significantly improving fine-tuning efficiency and reducing dependence on action-annotated data, FRAPPE provides a scalable and data-efficient pathway to enhance world-awareness in generalist robotic policies. Experiments on the RoboTwin benchmark and real-world tasks demonstrate that FRAPPE outperforms state-of-the-art approaches and shows strong generalization in long-horizon and unseen scenarios.
翻译:赋能视觉语言动作(VLA)模型预测环境动态,即世界建模,已被公认为提升机器人推理与泛化能力的关键。然而,现有方法面临两大问题:1. 训练目标迫使模型过度关注像素级重建,这限制了语义学习与泛化能力;2. 推理过程中对预测未来观测的依赖常导致误差累积。为应对这些挑战,我们提出了通过并行渐进扩展的未来表征对齐方法(FRAPPE)。该方法采用两阶段微调策略:在中期训练阶段,模型学习预测未来观测的潜在表征;在后训练阶段,我们并行扩展计算负载,并同时将表征与多个不同的视觉基础模型进行对齐。通过显著提升微调效率并减少对动作标注数据的依赖,FRAPPE为增强通用机器人策略的世界感知能力提供了一条可扩展且数据高效的途径。在RoboTwin基准测试和真实世界任务上的实验表明,FRAPPE优于现有最先进方法,并在长视界和未见场景中展现出强大的泛化能力。