Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. This work presents Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: \url{https://sites.google.com/view/2023arp}.
翻译:在模仿学习中,开发能够适应未知环境的智能体仍是一项艰巨挑战。本工作提出自适应返回条件策略(ARP),这是一种高效框架,旨在利用自然语言任务描述和预训练多模态编码器增强智能体的泛化能力。我们的核心思想是在预训练多模态嵌入空间(如CLIP)中计算视觉观测与自然语言指令之间的相似度,并将其作为奖励信号。随后,我们使用标注有多模态奖励的专家演示数据训练返回条件策略。由于多模态奖励在每个时间步提供自适应信号,ARP有效缓解了目标泛化错误问题。与现有文本条件策略相比,即使在面对未见文本指令时,ARP仍展现出更优的泛化性能。为提升奖励质量,我们还引入了一种针对预训练多模态编码器的微调方法,进一步增强了性能。视频演示和源代码可在项目网站获取:\url{https://sites.google.com/view/2023arp}。