Vision-Language-Action (VLA) models have recently enabled robotic manipulation by grounding visual and linguistic cues into actions. However, most VLAs assume the Markov property, relying only on the current observation and thus suffering from temporal myopia that degrades long-horizon coherence. In this work, we view motion as a more compact and informative representation of temporal context and world dynamics, capturing inter-state changes while filtering static pixel-level noise. From this perspective, HiF-VLA equips a motion-centric world model for the VLA, enabling agents to reason about temporal dynamics for future evolution during action generation. Building on this idea, we propose HiF-VLA (Hindsight, Insight, and Foresight for VLAs), a unified framework that leverages motion for bidirectional temporal reasoning. HiF-VLA encodes past dynamics through hindsight priors, anticipates future motion via foresight reasoning, and integrates both through a hindsight-modulated joint expert to enable a ''think-while-acting'' paradigm for long-horizon manipulation. As a result, HiF-VLA surpasses strong baselines on LIBERO-Long and CALVIN ABC-D benchmarks, while incurring negligible additional inference latency. Furthermore, HiF-VLA achieves substantial improvements in real-world long-horizon manipulation tasks, demonstrating its broad effectiveness in practical robotic settings.
翻译:视觉-语言-动作(VLA)模型近期通过将视觉和语言线索映射为动作,实现了机器人操控能力。然而,大多数VLA模型假设马尔可夫性质,仅依赖当前观测,因而受限于时间短视,导致长时域连贯性下降。本文提出将运动视为时间上下文和世界动态的更紧凑且信息丰富的表征,在过滤静态像素级噪声的同时捕捉状态间变化。基于此视角,HiF-VLA为VLA配备以运动为中心的世界模型,使智能体在动作生成过程中能够推理时间动态以预测未来演化。基于这一思想,我们提出HiF-VLA(面向VLA的回顾、洞察与前瞻),一种利用运动实现双向时间推理的统一框架。HiF-VLA通过回顾先验编码过去动态,借助前瞻推理预测未来运动,并通过回顾调制联合专家整合二者,为长时域操控实现“边思考边行动”范式。实验结果表明,HiF-VLA在LIBERO-Long和CALVIN ABC-D基准测试中显著超越强基线方法,且仅引入可忽略的额外推理延迟。此外,HiF-VLA在真实世界长时域操控任务中取得实质性提升,验证了其在实际机器人场景中的广泛有效性。