Pre-trained Vision-Language Models (VLMs) are able to understand visual concepts, describe and decompose complex tasks into sub-tasks, and provide feedback on task completion. In this paper, we aim to leverage these capabilities to support the training of reinforcement learning (RL) agents. In principle, VLMs are well suited for this purpose, as they can naturally analyze image-based observations and provide feedback (reward) on learning progress. However, inference in VLMs is computationally expensive, so querying them frequently to compute rewards would significantly slowdown the training of an RL agent. To address this challenge, we propose a framework named Code as Reward (VLM-CaR). VLM-CaR produces dense reward functions from VLMs through code generation, thereby significantly reducing the computational burden of querying the VLM directly. We show that the dense rewards generated through our approach are very accurate across a diverse set of discrete and continuous environments, and can be more effective in training RL policies than the original sparse environment rewards.
翻译:预训练的视觉语言模型(VLM)能够理解视觉概念、描述并分解复杂任务为子任务,以及对任务完成情况提供反馈。本文旨在利用这些能力支持强化学习(RL)智能体的训练。原则上,VLM非常适合此目的,因为它们可以自然地分析基于图像的观测结果,并针对学习进展提供反馈(奖励)。然而,VLM的推理计算成本高昂,频繁查询它们以计算奖励将显著拖慢RL智能体的训练进程。为解决这一挑战,我们提出名为“代码即奖励”(VLM-CaR)的框架。VLM-CaR通过代码生成从VLM中产生密集奖励函数,从而大幅降低直接查询VLM的计算负担。我们证明,在多种离散与连续环境中,通过我们的方法生成的密集奖励具有很高的准确性,并且相较于原始稀疏环境奖励,能更有效地训练RL策略。