Robust robotic manipulation requires not only predicting how the scene evolves over time, but also recognizing task-relevant objects in complex scenes. However, existing VLA models face two limitations. They typically act only on the current frame, while future prediction and object-aware reasoning are often learned in separate latent spaces. We propose OFlow (injecting Object-Aware Temporal Flow Matching into VLAs), a framework that addresses both limitations by unifying temporal foresight and object-aware reasoning in a shared semantic latent space. Our method forecasts future latents with temporal flow matching, factorizes them into object-aware representations that emphasize physically relevant cues while filtering task-irrelevant variation, and conditions continuous action generation on these predictions. By integrating OFlow into VLA pipelines, our method enables more reliable control under distribution shifts. Extensive experiments across LIBERO, LIBERO-Plus, MetaWorld, and SimplerEnv benchmarks and real-world tasks demonstrate that object-aware foresight consistently enhances robustness and success.
翻译:鲁棒机器人操作不仅需要预测场景随时间的演化,还需在复杂场景中识别任务相关物体。然而,现有VLA模型面临两大局限性:它们通常仅基于当前帧进行决策,而未来预测与目标感知推理往往在不同潜空间中独立学习。我们提出OFlow(将目标感知时域流匹配注入VLA),该框架通过在共享语义潜空间中统一时域预见与目标感知推理,同时解决上述两种局限性。我们的方法利用时域流匹配预测未来潜变量,将其分解为强调物理相关线索并过滤任务无关变化的目标感知表示,并基于这些预测生成连续动作条件。通过将OFlow集成到VLA流水线中,本方法能在分布偏移下实现更可靠的控制。在LIBERO、LIBERO-Plus、MetaWorld和SimplerEnv基准测试以及真实世界任务中的广泛实验证明,目标感知预见始终能增强鲁棒性与成功率。