视觉语言模型是上下文价值学习器 (Vision Language Models are In-Context Value Learners)

Yecheng Jason Ma,Joey Hejna,Ayzaan Wahid,Chuyuan Fu,Dhruv Shah,Jacky Liang,Zhuo Xu,Sean Kirmani,Peng Xu,Danny Driess,Ted Xiao,Jonathan Tompson,Osbert Bastani,Dinesh Jayaraman,Wenhao Yu,Tingnan Zhang,Dorsa Sadigh,Fei Xia

from arxiv, Project website and demo: https://generative-value-learning.github.io/

Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can scale and generalize. To address these challenges, we present Generative Value Learning (\GVL), a universal value function estimator that leverages the world knowledge embedded in vision-language models (VLMs) to predict task progress. Naively asking a VLM to predict values for a video sequence performs poorly due to the strong temporal correlation between successive frames. Instead, GVL poses value estimation as a temporal ordering problem over shuffled video frames; this seemingly more challenging task encourages VLMs to more fully exploit their underlying semantic and temporal grounding capabilities to differentiate frames based on their perceived task progress, consequently producing significantly better value predictions. Without any robot or task specific training, GVL can in-context zero-shot and few-shot predict effective values for more than 300 distinct real-world tasks across diverse robot platforms, including challenging bimanual manipulation tasks. Furthermore, we demonstrate that GVL permits flexible multi-modal in-context learning via examples from heterogeneous tasks and embodiments, such as human videos. The generality of GVL enables various downstream applications pertinent to visuomotor policy learning, including dataset filtering, success detection, and advantage-weighted regression -- all without any model training or finetuning.

翻译：从视觉轨迹预测时间进程对于能够学习、适应和改进的智能机器人至关重要。然而，跨不同任务和领域学习这种进度估计器（或称时间价值函数）既需要大量多样化数据，也需要能够扩展和泛化的方法。为应对这些挑战，我们提出生成式价值学习（Generative Value Learning, GVL），这是一种通用价值函数估计器，它利用嵌入在视觉语言模型（Vision-Language Models, VLMs）中的世界知识来预测任务进度。直接要求VLM预测视频序列的价值表现不佳，因为连续帧之间存在强时间相关性。相反，GVL将价值估计构建为一个针对打乱视频帧的时间排序问题；这个看似更具挑战性的任务促使VLMs更充分地利用其潜在的语义和时间基础能力，根据感知到的任务进度区分帧，从而产生显著更好的价值预测。无需任何机器人或任务特定训练，GVL能够在上下文中进行零样本和少样本预测，为超过300种不同的真实世界任务（包括具有挑战性的双手操作任务）跨多种机器人平台生成有效的价值估计。此外，我们证明GVL允许通过来自异构任务和具身形式（例如人类视频）的示例进行灵活的多模态上下文学习。GVL的通用性使其能够支持与视觉运动策略学习相关的各种下游应用，包括数据集过滤、成功检测和优势加权回归——所有这些都无需任何模型训练或微调。