Natural language instruction following is paramount to enable collaboration between artificial agents and human beings. Natural language-conditioned reinforcement learning (RL) agents have shown how natural languages' properties, such as compositionality, can provide a strong inductive bias to learn complex policies. Previous architectures like HIGhER combine the benefit of language-conditioning with Hindsight Experience Replay (HER) to deal with sparse rewards environments. Yet, like HER, HIGhER relies on an oracle predicate function to provide a feedback signal highlighting which linguistic description is valid for which state. This reliance on an oracle limits its application. Additionally, HIGhER only leverages the linguistic information contained in successful RL trajectories, thus hurting its final performance and data-efficiency. Without early successful trajectories, HIGhER is no better than DQN upon which it is built. In this paper, we propose the Emergent Textual Hindsight Experience Replay (ETHER) agent, which builds on HIGhER and addresses both of its limitations by means of (i) a discriminative visual referential game, commonly studied in the subfield of Emergent Communication (EC), used here as an unsupervised auxiliary task and (ii) a semantic grounding scheme to align the emergent language with the natural language of the instruction-following benchmark. We show that the referential game's agents make an artificial language emerge that is aligned with the natural-like language used to describe goals in the BabyAI benchmark and that it is expressive enough so as to also describe unsuccessful RL trajectories and thus provide feedback to the RL agent to leverage the linguistic, structured information contained in all trajectories. Our work shows that EC is a viable unsupervised auxiliary task for RL and provides missing pieces to make HER more widely applicable.
翻译:自然语言指令遵循对于实现人工智能体与人类之间的协作至关重要。基于自然语言条件的强化学习(RL)智能体已展示了自然语言的组合性等特性如何为学习复杂策略提供强大的归纳偏置。先前的主干架构(如HIGhER)将语言条件化的优势与后见经验回放(HER)相结合,以处理稀疏奖励环境。然而,与HER类似,HIGhER依赖于一个预言机谓词函数来提供反馈信号,指示哪些语言描述对于哪些状态是有效的。这种对预言机的依赖限制了其应用范围。此外,HIGhER仅利用成功RL轨迹中包含的语言信息,从而损害了其最终性能和样本效率。若缺乏早期的成功轨迹,HIGhER的表现甚至不如其所基于的DQN。在本文中,我们提出了涌现文本后见经验回放(ETHER)智能体,该智能体基于HIGhER构建,并通过以下两种方式解决上述局限性:(i)一种判别性视觉指涉博弈(常见于涌现通信子领域的研究),在此作为无监督辅助任务;(ii)一种语义对齐方案,将涌现语言与指令遵循基准中的自然语言进行对齐。我们证明,该指涉博弈的智能体生成了与BabyAI基准中用于描述目标的类自然语言对齐的人工语言,并且该语言具备充分表达能力,可同时描述失败的RL轨迹,从而为RL智能体提供反馈,以利用所有轨迹中包含的语言结构化信息。我们的工作表明,涌现通信是RL的一种可行无监督辅助任务,并为使HER更广泛适用提供了缺失环节。