Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language as a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.
翻译:构建能够自主发现多样化行为的开放型智能体是人工智能的长期目标之一。这一挑战可在自决强化学习智能体框架下进行研究,即智能体通过自主选择和追求自身目标来组织学习进程,实现自我驱动的课程学习。近期研究指出语言是自决学习的关键维度,尤其因其能够通过社会同伴的 hindsight relabeling 实现抽象目标采样与指导。基于此视角,我们探究以下开放科学问题:社会同伴的 hindsight 反馈(如选择性反馈与全面性反馈)会产生何种影响?智能体如何从经验回放中极为罕见的语言目标示例中学习?如何融合多种探索形式,并利用简单目标作为通向复杂目标的阶梯?为解答这些问题,我们采用 ScienceWorld 这一具有丰富抽象化与组合性物理机制的文本环境进行实验。结果表明:社会同伴需进行选择性反馈;经验回放需对罕见目标示例进行过采样;遵循智能体能力处于中等水平的自生成目标序列将显著提升最终性能。