Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language has a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.
翻译:构建能够自主发现多样化行为的开放式智能体是人工智能的长期目标之一。这一挑战可在自动目标型强化学习智能体框架下研究,即智能体通过选择和追求自身目标来学习,自我组织学习课程。近期研究表明,语言是自动目标型学习的关键维度,特别是它能够支持抽象目标采样以及来自社会同伴的 hindsight relabeling 指导。在此视角下,我们研究以下开放性科学问题:社会同伴的 hindsight 反馈(如选择性反馈与全面反馈)有何影响?智能体如何从其经验回放中利用极稀有的语言目标示例进行学习?如何结合多种探索形式,并将简单目标作为达到复杂目标的垫脚石?为解决这些问题,我们使用 ScienceWorld 这一包含丰富抽象与组合物理特性的文本环境。研究表明:社会同伴反馈的选择性至关重要;经验回放需对稀有目标示例进行过采样;遵循智能体能力处于中等水平的自生成目标序列能显著提升最终性能。