For a general-purpose robot to operate in reality, executing a broad range of instructions across various environments is imperative. Central to the reinforcement learning and planning for such robotic agents is a generalizable reward function. Recent advances in vision-language models, such as CLIP, have shown remarkable performance in the domain of deep learning, paving the way for open-domain visual recognition. However, collecting data on robots executing various language instructions across multiple environments remains a challenge. This paper aims to transfer video-language models with robust generalization into a generalizable language-conditioned reward function, only utilizing robot video data from a minimal amount of tasks in a singular environment. Unlike common robotic datasets used for training reward functions, human video-language datasets rarely contain trivial failure videos. To enhance the model's ability to distinguish between successful and failed robot executions, we cluster failure video features to enable the model to identify patterns within. For each cluster, we integrate a newly trained failure prompt into the text encoder to represent the corresponding failure mode. Our language-conditioned reward function shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.
翻译:为使通用机器人在现实中运行,其必须能够在各种环境中执行广泛的指令。对此类机器人智能体进行强化学习与规划的核心在于一个可泛化的奖励函数。近期视觉-语言模型(如CLIP)在深度学习领域展现出卓越性能,为开放域视觉识别开辟了道路。然而,收集机器人在多种环境中执行各类语言指令的数据仍具挑战。本文旨在将具有强大泛化能力的视频-语言模型迁移为一种可泛化的语言条件奖励函数,且仅利用单一环境中少量任务产生的机器人视频数据。与常用于训练奖励函数的机器人数据集不同,人类视频-语言数据集极少包含琐碎的失败视频。为增强模型区分机器人执行成功与失败的能力,我们对失败视频特征进行聚类,使模型能够识别其中的模式。针对每个聚类,我们在文本编码器中集成一个新训练得到的失败提示,以表征相应的失败模式。我们的语言条件奖励函数在机器人规划与强化学习中,对新环境和新指令展现出卓越的泛化能力。