There has been a recent surge of interest in developing generally-capable agents that can adapt to new tasks without additional training in the environment. Learning world models from reward-free exploration is a promising approach, and enables policies to be trained using imagined experience for new tasks. Achieving a general agent requires robustness across different environments. However, different environments may require different amounts of data to learn a suitable world model. In this work, we address the problem of efficiently learning robust world models in the reward-free setting. As a measure of robustness, we consider the minimax regret objective. We show that the minimax regret objective can be connected to minimising the maximum error in the world model across environments. This informs our algorithm, WAKER: Weighted Acquisition of Knowledge across Environments for Robustness. WAKER selects environments for data collection based on the estimated error of the world model for each environment. Our experiments demonstrate that WAKER outperforms naive domain randomisation, resulting in improved robustness, efficiency, and generalisation.
翻译:近期,开发能够适应新任务而无需在环境中额外训练的全能智能体引起了广泛关注。从无奖励探索中学习世界模型是一种有前景的方法,它能利用对新任务的想象经验来训练策略。实现通用智能体需要跨不同环境的鲁棒性。然而,不同环境可能需要不同数量的数据来学习合适的这个世界模型。在这项工作中,我们解决了在无奖励设置下高效学习鲁棒世界模型的问题。作为鲁棒性的度量,我们考虑最小化最大遗憾目标。我们证明,最小化最大遗憾目标可以关联到跨环境最小化世界模型的最大误差。这启发了我们的算法WAKER:跨环境加权知识获取以实现鲁棒性。WAKER基于世界模型在每个环境中的估计误差来选择数据收集的环境。我们的实验表明,WAKER优于朴素的域随机化方法,从而提高了鲁棒性、效率和泛化能力。