Unsupervised pre-training can equip reinforcement learning agents with prior knowledge and accelerate learning in downstream tasks. A promising direction, grounded in human development, investigates agents that learn by setting and pursuing their own goals. The core challenge lies in how to effectively generate, select, and learn from such goals. Our focus is on broad distributions of downstream tasks where solving every task zero-shot is infeasible. Such settings naturally arise when the target tasks lie outside of the pre-training distribution or when their identities are unknown to the agent. In this work, we (i) optimize for efficient multi-episode exploration and adaptation within a meta-learning framework, and (ii) guide the training curriculum with evolving estimates of the agent's post-adaptation performance. We present ULEE, an unsupervised meta-learning method that combines an in-context learner with an adversarial goal-generation strategy that maintains training at the frontier of the agent's capabilities. On XLand-MiniGrid benchmarks, ULEE pre-training yields improved exploration and adaptation abilities that generalize to novel objectives, environment dynamics, and map structures. The resulting policy attains improved zero-shot and few-shot performance, and provides a strong initialization for longer fine-tuning processes. It outperforms learning from scratch, DIAYN pre-training, and alternative curricula.
翻译:无监督预训练能够为强化学习智能体提供先验知识,并加速下游任务的学习进程。受人类发展机制启发,一个前景广阔的研究方向是让智能体通过自主设定并追求目标进行学习。其核心挑战在于如何有效地生成、选择并利用这些目标进行学习。我们关注的是具有广泛分布的下游任务场景,其中实现所有任务的零样本求解是不可行的。当目标任务超出预训练分布范围,或智能体无法获知任务具体信息时,此类场景会自然出现。在本研究中,我们(i)通过元学习框架优化多轮次探索与自适应效率,(ii)利用智能体适应后性能的渐进估计来指导训练课程。我们提出了ULEE方法——一种结合情境学习器与对抗性目标生成策略的无监督元学习方法,该方法能持续将训练维持在智能体能力边界前沿。在XLand-MiniGrid基准测试中,ULEE预训练展现出可泛化至新目标、环境动态及地图结构的探索与适应能力提升。所得策略在零样本与小样本场景中均获得性能改进,并为长时微调过程提供了优质初始化基础。相较于从头训练、DIAYN预训练及其他课程学习方法,该方法均表现出更优性能。