Using vision-language models (VLMs) as reward models in reinforcement learning holds promise for reducing costs and improving safety. So far, VLM reward models have only been used for goal-oriented tasks, where the agent must reach a particular final outcome. We explore VLMs' potential to supervise tasks that cannot be scored by the final state alone. To this end, we introduce ViSTa, a dataset for evaluating Vision-based understanding of Sequential Tasks. ViSTa comprises over 4,000 videos with step-by-step descriptions in virtual home, Minecraft, and real-world environments. Its novel hierarchical structure -- basic single-step tasks composed into more and more complex sequential tasks -- allows a fine-grained understanding of how well VLMs can judge tasks with varying complexity. To illustrate this, we use ViSTa to evaluate state-of-the-art VLMs, including CLIP, ViCLIP, and GPT-4o. We find that, while they are all good at object recognition, they fail to understand sequential tasks, with only GPT-4o achieving non-trivial performance.
翻译:在强化学习中使用视觉语言模型(VLM)作为奖励模型,有望降低成本并提高安全性。迄今为止,VLM奖励模型仅被用于目标导向型任务,即智能体必须达成特定的最终结果。我们探索了VLM在监督那些无法仅通过最终状态来评分的任务方面的潜力。为此,我们引入了ViSTa数据集,用于评估基于视觉的序列任务理解能力。ViSTa包含超过4,000个视频,涵盖虚拟家庭、Minecraft和真实世界环境,并配有逐步描述。其新颖的层次结构——将基础的单步任务组合成越来越复杂的序列任务——使得我们能够细致地理解VLM在判断不同复杂度任务时的表现。为说明这一点,我们使用ViSTa评估了包括CLIP、ViCLIP和GPT-4o在内的前沿VLM。我们发现,尽管它们在物体识别方面表现良好,但均未能理解序列任务,其中仅有GPT-4o取得了非平凡的性能。