Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language has a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.
翻译:构建能够自主发现多样化行为的开放式智能体是人工智能的长期目标之一。这一挑战可在自驱型强化学习智能体框架下进行研究,即智能体通过自主选择并追求自身目标来学习,自我组织学习课程。近期研究指出语言是自驱型学习的关键维度,尤其因为它能实现抽象目标采样以及从社会同伴处获得事后重标注的指导。在此视角下,我们探讨以下开放科学问题:来自社会同伴的事后反馈(如选择性反馈与全面性反馈)有何影响?智能体如何从经验回放中极为稀少的语言目标样例中学习?如何结合多种探索形式,并利用较易目标作为跳板达成更难目标?为解答这些问题,我们采用ScienceWorld这一包含丰富抽象与组合物理学的文本环境。研究表明:社会同伴反馈的选择性至关重要;经验回放需对稀有目标样例进行过采样;遵循智能体能力处于中间水平的自生成目标序列可显著提升最终性能。