通过策略自举实现层级扁平化 (Flattening Hierarchies with Policy Bootstrapping)

Offline goal-conditioned reinforcement learning (GCRL) is a promising approach for pretraining generalist policies on large datasets of reward-free trajectories, akin to the self-supervised objectives used to train foundation models for computer vision and natural language processing. However, scaling GCRL to longer horizons remains challenging due to the combination of sparse rewards and discounting, which obscures the comparative advantages of primitive actions with respect to distant goals. Hierarchical RL methods achieve strong empirical results on long-horizon goal-reaching tasks, but their reliance on modular, timescale-specific policies and subgoal generation introduces significant additional complexity and hinders scaling to high-dimensional goal spaces. In this work, we introduce an algorithm to train a flat (non-hierarchical) goal-conditioned policy by bootstrapping on subgoal-conditioned policies with advantage-weighted importance sampling. Our approach eliminates the need for a generative model over the (sub)goal space, which we find is key for scaling to high-dimensional control in large state spaces. We further show that existing hierarchical and bootstrapping-based approaches correspond to specific design choices within our derivation. Across a comprehensive suite of state- and pixel-based locomotion and manipulation benchmarks, our method matches or surpasses state-of-the-art offline GCRL algorithms and scales to complex, long-horizon tasks where prior approaches fail. Project page: https://johnlyzhou.github.io/saw/

翻译：离线目标条件强化学习（GCRL）是一种在大规模无奖励轨迹数据集上预训练通用策略的有前景方法，类似于用于训练计算机视觉和自然语言处理基础模型的自监督目标。然而，由于稀疏奖励与折扣因子的共同作用，GCRL在扩展至长时程任务时仍面临挑战，这导致原始动作相对于远期目标的比较优势变得模糊。分层强化学习方法在长时程目标达成任务上取得了显著的实证效果，但其对模块化、时标特定策略及子目标生成的依赖引入了显著的额外复杂性，并阻碍了向高维目标空间的扩展。本研究提出一种算法，通过基于优势加权重要性采样的子目标条件策略进行自举，训练扁平化（非分层）的目标条件策略。该方法消除了对（子）目标空间生成模型的需求，我们发现这是在大规模状态空间中扩展至高维控制任务的关键。我们进一步证明，现有的分层方法与基于自举的方法均对应于本推导框架中的特定设计选择。在一系列全面的基于状态与像素的移动操作与机械操控基准测试中，本方法达到或超越了当前最先进的离线GCRL算法性能，并能扩展至先前方法无法处理的复杂长时程任务。项目页面：https://johnlyzhou.github.io/saw/