What data or environments to use for training to improve downstream performance is a longstanding and very topical question in reinforcement learning. In particular, Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. Surprisingly, despite methods aiming to maximise regret in theory, the practical approximations do not correlate with regret but with success rate. As a result, a significant portion of an agent's experience comes from environments it has already mastered, offering little to no contribution toward enhancing its abilities. Put differently, current methods fail to predict intuitive measures of ``learnability.'' Specifically, they are unable to consistently identify those scenarios that the agent can sometimes solve, but not always. Based on our analysis, we develop a method that directly trains on scenarios with high learnability. This simple and intuitive approach outperforms existing UED methods in several binary-outcome environments, including the standard domain of Minigrid and a novel setting closely inspired by a real-world robotics problem. We further introduce a new adversarial evaluation procedure for directly measuring robustness, closely mirroring the conditional value at risk (CVaR). We open-source all our code and present visualisations of final policies here: https://github.com/amacrutherford/sampling-for-learnability.
翻译:在强化学习中,选择何种数据或环境进行训练以提升下游性能,是一个长期存在且极具时效性的课题。特别是,无监督环境设计(UED)方法近期备受关注,因其自适应课程有望使智能体对分布内及分布外任务均保持鲁棒性。本研究探讨了现有UED方法如何选择训练环境,重点关注任务优先级度量指标。令人惊讶的是,尽管这些方法在理论上旨在最大化遗憾,其实践中的近似度量却与遗憾无关,而与成功率相关。因此,智能体的大部分经验来自其已完全掌握的环境,这些环境对其能力提升贡献甚微甚至毫无贡献。换言之,现有方法未能有效预测“可学习性”这一直观度量。具体而言,它们无法稳定识别那些智能体有时能解决、但并非总能解决的情景。基于我们的分析,我们开发了一种直接针对高可学习性情景进行训练的方法。这种简单直观的方法在多个二元结果环境中超越了现有UED方法,包括标准领域Minigrid以及一个受真实世界机器人问题启发而设计的新颖场景。我们进一步引入了一种新的对抗性评估程序,用于直接测量鲁棒性,该方法与条件风险价值(CVaR)高度契合。我们在此开源所有代码并展示最终策略的可视化结果:https://github.com/amacrutherford/sampling-for-learnability。