There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. However, achieving a general agent requires robustness across different environments. In this work, we address the novel problem of generating curricula in the reward-free setting to train robust world models. We consider robustness in terms of minimax regret over all environment instantiations and show that the minimax regret can be connected to minimising the maximum error in the world model across environment instances. This result informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms several baselines, resulting in improved robustness, efficiency, and generalisation.
翻译:近期,开发能够无需额外环境训练即适应新任务的通用型智能体引起了广泛关注。从无奖励探索中学习世界模型是一种颇具前景的方法,它使得策略能够针对新任务利用想象经验进行训练。然而,实现通用智能体需要具备跨不同环境的鲁棒性。本研究首次提出在无奖励场景下生成课程以训练鲁棒世界模型这一新问题。我们以所有环境实例上的极小化极大遗憾作为鲁棒性衡量标准,并证明该极小化极大遗憾可关联到跨环境实例世界模型最大误差的最小化。这一发现催生了我们的算法WAKER:面向鲁棒性的跨环境知识加权获取。WAKER基于世界模型对每个环境的估计误差选择数据采集环境。实验表明,WAKER在多个基准测试中表现优异,实现了更强的鲁棒性、更高的效率与更好的泛化能力。