Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.
翻译:使用学习奖励函数(LRFs)作为解决稀疏奖励强化学习(RL)任务的手段,多年来在任务复杂性方面取得了一些稳步进展。在这项工作中,我们质疑当前LRFs是否最适合作为任务奖励的直接替代。相反,我们提出利用LRFs的能力作为RL的预训练信号。具体而言,我们提出$\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining(LAMP),该方法利用视觉-语言模型(VLMs)的零样本能力作为RL的$\textit{预训练}$工具,而非下游任务奖励。LAMP使用冻结的预训练VLM,通过计算高度多样化的语言指令与智能体在预训练环境中的图像观察之间的对比对齐,可扩展地生成带有噪声但具有塑形效果的探索奖励。LAMP结合标准的新奇探索奖励,通过强化学习优化这些奖励,以获取语言条件化的预训练策略。我们的VLM预训练方法,不同于先前使用LRFs的尝试,能够为RLBench中机器人操作任务的样本高效学习提供热启动。