Offline goal-conditioned RL (GCRL) offers a way to train general-purpose agents from fully offline datasets. In addition to being conservative within the dataset, the generalization ability to achieve unseen goals is another fundamental challenge for offline GCRL. However, to the best of our knowledge, this problem has not been well studied yet. In this paper, we study out-of-distribution (OOD) generalization of offline GCRL both theoretically and empirically to identify factors that are important. In a number of experiments, we observe that weighted imitation learning enjoys better generalization than pessimism-based offline RL method. Based on this insight, we derive a theory for OOD generalization, which characterizes several important design choices. We then propose a new offline GCRL method, Generalizable Offline goAl-condiTioned RL (GOAT), by combining the findings from our theoretical and empirical studies. On a new benchmark containing 9 independent identically distributed (IID) tasks and 17 OOD tasks, GOAT outperforms current state-of-the-art methods by a large margin.
翻译:离线目标条件强化学习(GCRL)提供了一种从完全离线数据集训练通用智能体的方法。除了在数据集内保持保守性外,实现未见目标的泛化能力是离线GCRL面临的另一个基本挑战。然而,据我们所知,这一问题尚未得到充分研究。本文从理论和实验两方面研究离线GCRL的分布外(OOD)泛化问题,以识别关键影响因素。在大量实验中,我们观察到加权模仿学习比基于悲观的离线强化学习方法具有更好的泛化性能。基于这一发现,我们推导出OOD泛化理论,刻画了若干重要设计选择。随后,结合理论分析与实验研究的成果,我们提出了一种新的离线GCRL方法——可泛化离线目标条件强化学习(GOAT)。在一个包含9个独立同分布(IID)任务和17个OOD任务的新基准测试中,GOAT以显著优势超越了当前最先进的方法。