Despite recent breakthroughs in reinforcement learning (RL) and imitation learning (IL), existing algorithms fail to generalize beyond the training environments. In reality, humans can adapt to new tasks quickly by leveraging prior knowledge about the world such as language descriptions. To facilitate the research on language-guided agents with domain adaption, we propose a novel zero-shot compositional policy learning task, where the environments are characterized as a composition of different attributes. Since there are no public environments supporting this study, we introduce a new research platform BabyAI++ in which the dynamics of environments are disentangled from visual appearance. At each episode, BabyAI++ provides varied vision-dynamics combinations along with corresponding descriptive texts. To evaluate the adaption capability of learned agents, a set of vision-dynamics pairings are held-out for testing on BabyAI++. Unsurprisingly, we find that current language-guided RL/IL techniques overfit to the training environments and suffer from a huge performance drop when facing unseen combinations. In response, we propose a multi-modal fusion method with an attention mechanism to perform visual language-grounding. Extensive experiments show strong evidence that language grounding is able to improve the generalization of agents across environments with varied dynamics.
翻译:尽管强化学习与模仿学习近期取得突破性进展,现有算法仍无法泛化至训练环境以外的场景。实际上,人类能够通过调用语言描述等先验世界知识快速适应新任务。为促进具备领域适应能力的语言引导智能体研究,我们提出了零样本组合策略学习新任务——该任务中的环境由不同属性组合表征。鉴于尚无支持该研究的公开环境,我们构建了全新研究平台BabyAI++,其环境动态机制与视觉外观实现解耦。每轮测试中,BabyAI++提供多样化的视觉-动态组合及对应描述文本。为评估习得智能体的适应能力,我们在BabyAI++上预留了部分视觉-动态配对用于测试。不出所料,当前语言引导的强化学习/模仿学习方法存在对训练环境的过拟合,当面对未见组合时性能大幅下降。为此,我们提出一种基于注意力机制的多模态融合方法以实现视觉语言锚定。大量实验强有力证明,语言锚定能提升智能体在跨动态环境中的泛化能力。