Offline goal-conditioned RL (GCRL) offers a way to train general-purpose agents from fully offline datasets. In addition to being conservative within the dataset, the generalization ability to achieve unseen goals is another fundamental challenge for offline GCRL. However, to the best of our knowledge, this problem has not been well studied yet. In this paper, we study out-of-distribution (OOD) generalization of offline GCRL both theoretically and empirically to identify factors that are important. In a number of experiments, we observe that weighted imitation learning enjoys better generalization than pessimism-based offline RL method. Based on this insight, we derive a theory for OOD generalization, which characterizes several important design choices. We then propose a new offline GCRL method, Generalizable Offline goAl-condiTioned RL (GOAT), by combining the findings from our theoretical and empirical studies. On a new benchmark containing 9 independent identically distributed (IID) tasks and 17 OOD tasks, GOAT outperforms current state-of-the-art methods by a large margin.
翻译:离线目标条件强化学习(GCRL)提供了一种从完全离线数据集中训练通用智能体的方法。除了在数据集中保持保守性外,实现未见目标的泛化能力是离线GCRL面临的另一个基本挑战。然而,据我们所知,该问题尚未得到充分研究。本文从理论和实验两方面研究了离线GCRL的分布外(OOD)泛化问题,旨在识别重要影响因素。在多项实验中,我们观察到加权模仿学习比基于悲观主义的离线强化学习方法具有更好的泛化性。基于这一发现,我们推导出OOD泛化理论,该理论刻画了若干关键设计选择。进而,结合理论与实验研究成果,我们提出了一种新的离线GCRL方法——泛化离线目标条件强化学习(GOAT)。在包含9个独立同分布(IID)任务和17个OOD任务的新基准测试中,GOAT的性能大幅领先当前最先进方法。