For embodied reinforcement learning (RL) agents interacting with the environment, it is desirable to have rapid policy adaptation to unseen visual observations, but achieving zero-shot adaptation capability is considered as a challenging problem in the RL context. To address the problem, we present a novel contrastive prompt ensemble (ConPE) framework which utilizes a pretrained vision-language model and a set of visual prompts, thus enabling efficient policy learning and adaptation upon a wide range of environmental and physical changes encountered by embodied agents. Specifically, we devise a guided-attention-based ensemble approach with multiple visual prompts on the vision-language model to construct robust state representations. Each prompt is contrastively learned in terms of an individual domain factor that significantly affects the agent's egocentric perception and observation. For a given task, the attention-based ensemble and policy are jointly learned so that the resulting state representations not only generalize to various domains but are also optimized for learning the task. Through experiments, we show that ConPE outperforms other state-of-the-art algorithms for several embodied agent tasks including navigation in AI2THOR, manipulation in egocentric-Metaworld, and autonomous driving in CARLA, while also improving the sample efficiency of policy learning and adaptation.
翻译:对于与环境交互的具身强化学习(RL)智能体,实现对新视觉观测的快速策略自适应是理想目标,但在RL背景下实现零样本自适应能力仍被视为一个具有挑战性的问题。为解决该问题,我们提出了一种新颖的对比提示集成(ConPE)框架,该框架利用预训练的视觉-语言模型和一组视觉提示,从而能够在具身智能体遇到各种环境和物理变化时实现高效策略学习与自适应。具体而言,我们设计了一种基于引导注意力的集成方法,通过在视觉-语言模型上使用多个视觉提示来构建鲁棒的状态表征。每个提示均针对显著影响智能体自我中心感知与观测的特定领域因素进行对比学习。对于给定任务,基于注意力的集成机制与策略被联合学习,使得生成的状态表征不仅能够泛化至不同领域,同时也能为任务学习进行优化。通过实验证明,ConPE在多项具身智能体任务中(包括AI2THOR中的导航、自我中心-Metaworld中的操作以及CARLA中的自动驾驶)均优于其他最先进算法,同时提升了策略学习与自适应的样本效率。