Enabling Adaptive Agent Training in Open-Ended Simulators by Targeting Diversity

The wider application of end-to-end learning methods to embodied decision-making domains remains bottlenecked by their reliance on a superabundance of training data representative of the target domain. Meta-reinforcement learning (meta-RL) approaches abandon the aim of zero-shot generalization--the goal of standard reinforcement learning (RL)--in favor of few-shot adaptation, and thus hold promise for bridging larger generalization gaps. While learning this meta-level adaptive behavior still requires substantial data, efficient environment simulators approaching real-world complexity are growing in prevalence. Even so, hand-designing sufficiently diverse and numerous simulated training tasks for these complex domains is prohibitively labor-intensive. Domain randomization (DR) and procedural generation (PG), offered as solutions to this problem, require simulators to possess carefully-defined parameters which directly translate to meaningful task diversity--a similarly prohibitive assumption. In this work, we present DIVA, an evolutionary approach for generating diverse training tasks in such complex, open-ended simulators. Like unsupervised environment design (UED) methods, DIVA can be applied to arbitrary parameterizations, but can additionally incorporate realistically-available domain knowledge--thus inheriting the flexibility and generality of UED, and the supervised structure embedded in well-designed simulators exploited by DR and PG. Our empirical results showcase DIVA's unique ability to overcome complex parameterizations and successfully train adaptive agent behavior, far outperforming competitive baselines from prior literature. These findings highlight the potential of such semi-supervised environment design (SSED) approaches, of which DIVA is the first humble constituent, to enable training in realistic simulated domains, and produce more robust and capable adaptive agents.

翻译：端到端学习方法在具身决策领域的广泛应用，仍受限于其对目标领域海量代表性训练数据的依赖。元强化学习（meta-RL）方法放弃了零样本泛化（标准强化学习的目标），转而追求少样本适应，因此有望弥合更大的泛化鸿沟。虽然学习这种元层面的自适应行为仍需大量数据，但逼近真实世界复杂度的高效环境模拟器正日益普及。即便如此，为这些复杂领域手工设计足够多样且数量充足的模拟训练任务，其工作量令人望而却步。领域随机化（DR）和程序生成（PG）作为该问题的解决方案，要求模拟器具备精心定义的参数，这些参数能直接转化为有意义的任务多样性——这同样是一个难以满足的假设。本研究提出DIVA，一种在复杂开放世界模拟器中生成多样化训练任务的进化方法。与无监督环境设计（UED）方法类似，DIVA可应用于任意参数化方案，同时还能整合现实可用的领域知识——从而继承了UED的灵活性与通用性，以及DR和PG所利用的精心设计模拟器中嵌入的监督结构。实证结果表明，DIVA具备独特能力，能够克服复杂参数化问题并成功训练自适应智能体行为，其性能远超文献中已有的竞争基线。这些发现凸显了此类半监督环境设计（SSED）方法的潜力——DIVA是其首个初步构成要素——能够在真实模拟领域中实现训练，并产生更鲁棒、更强大的自适应智能体。