Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. In this work, we present Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: https://sites.google.com/view/2023arp.
翻译:在模仿学习中,开发能够适应未知环境的智能体仍是一项严峻挑战。本文提出自适应回报条件策略(Adaptive Return-conditioned Policy,ARP),该高效框架利用自然语言任务描述与预训练多模态编码器增强智能体的泛化能力。核心思想是在预训练多模态嵌入空间(如CLIP)中计算视觉观察与自然语言指令之间的相似度,将其作为奖励信号,进而利用标注多模态奖励的专家示范训练回报条件策略。由于多模态奖励在每个时间步提供自适应信号,ARP有效缓解了目标错误泛化问题。与现有文本条件策略相比,即使面对未见过的文本指令,ARP也能展现出更优的泛化性能。为提升奖励质量,本文还提出针对预训练多模态编码器的微调方法,进一步强化模型表现。视频演示与源代码已发布至项目网站:https://sites.google.com/view/2023arp。