Vision-language-action (VLA) models have advanced robot manipulation through large-scale pretraining, but real-world deployment remains challenging due to partial observability and delayed feedback. Reinforcement learning addresses this via value functions, which assess task progress and guide policy improvement. However, existing value models built on vision-language models (VLMs) struggle to capture temporal dynamics, undermining reliable value estimation in long-horizon tasks. In this paper, we propose ViVa, a video-generative value model that repurposes a pretrained video generator for value estimation. Taking the current observation and robot proprioception as input, ViVa jointly predicts future proprioception and a scalar value for the current state. By leveraging the spatiotemporal priors of a pretrained video generator, our approach grounds value estimation in anticipated embodiment dynamics, moving beyond static snapshots to intrinsically couple value with foresight. Integrated into RECAP, ViVa delivers substantial improvements on real-world box assembly. Qualitative analysis across all three tasks confirms that ViVa produces more reliable value signals, accurately reflecting task progress. By leveraging spatiotemporal priors from video corpora, ViVa also generalizes to novel objects, highlighting the promise of video-generative models for value estimation.
翻译:视觉-语言-动作(VLA)模型通过大规模预训练推动了机器人操作技术的发展,但在实际部署中仍面临局部可观测性和延迟反馈的挑战。强化学习通过价值函数应对这一问题——价值函数可评估任务进度并指导策略改进。然而,现有基于视觉-语言模型(VLM)的价值模型难以捕捉时间动态特性,导致在长时域任务中价值估计可靠性不足。本文提出ViVa,一种视频生成价值模型,该模型重新利用预训练视频生成器进行价值估计。ViVa以当前观测和机器人本体感知为输入,联合预测未来本体感知状态和当前状态的标量价值。通过利用预训练视频生成器的时空先验知识,我们的方法将价值估计锚定于预期的具身动态过程,超越静态快照模式,实现价值与前瞻预测的内在耦合。将ViVa集成至RECAP框架后,其在真实场景的箱体装配任务中取得了显著性能提升。对全部三项任务的定性分析证实,ViVa能够生成更可靠的价值信号,准确反映任务进度。通过利用视频语料库的时空先验知识,ViVa还可泛化至新颖物体,凸显了视频生成模型在价值估计领域的应用潜力。