Offline Goal-Conditioned RL (GCRL) offers a feasible paradigm for learning general-purpose policies from diverse and multi-task offline datasets. Despite notable recent progress, the predominant offline GCRL methods, mainly model-free, face constraints in handling limited data and generalizing to unseen goals. In this work, we propose Goal-conditioned Offline Planning (GOPlan), a novel model-based framework that contains two key phases: (1) pretraining a prior policy capable of capturing multi-modal action distribution within the multi-goal dataset; (2) employing the reanalysis method with planning to generate imagined trajectories for funetuning policies. Specifically, we base the prior policy on an advantage-weighted conditioned generative adversarial network, which facilitates distinct mode separation, mitigating the pitfalls of out-of-distribution (OOD) actions. For further policy optimization, the reanalysis method generates high-quality imaginary data by planning with learned models for both intra-trajectory and inter-trajectory goals. With thorough experimental evaluations, we demonstrate that GOPlan achieves state-of-the-art performance on various offline multi-goal navigation and manipulation tasks. Moreover, our results highlight the superior ability of GOPlan to handle small data budgets and generalize to OOD goals.
翻译:摘要:离线目标条件强化学习(GCRL)提供了一种从多样化、多任务离线数据集中学习通用策略的可行范式。尽管近期取得了显著进展,但主流离线 GCRL 方法主要为无模型方法,在处理有限数据及泛化至未见目标时面临局限。本文提出目标条件离线规划方法(GOPlan),这是一种新颖的基于模型的框架,包含两个关键阶段:(1) 预训练能够捕获多目标数据集中多模态动作分布的先前策略;(2) 采用结合规划的重分析方法生成想象轨迹以微调策略。具体而言,我们基于优势加权条件生成对抗网络构建先前策略,该网络有助于实现显式模式分离,缓解分布外(OOD)动作的弊端。为进一步优化策略,重分析方法通过利用学习模型对轨迹内及轨迹间目标进行规划,生成高质量想象数据。通过详尽的实验评估,我们证明 GOPlan 在各类离线多目标导航与操作任务中均达到了最优性能。此外,我们的结果凸显了 GOPlan 在处理小数据预算及泛化至 OOD 目标方面的卓越能力。