In policy learning, stitching and compositional generalization refer to the extent to which the policy is able to piece together sub-trajectories of data it is trained on to generate new and diverse behaviours. While stitching has been identified as a significant strength of offline reinforcement learning, recent generative behavioural cloning (BC) methods have also shown proficiency at stitching. However, the main factors behind this are poorly understood, hindering the development of new algorithms that can reliably stitch by design. Focusing on diffusion planners trained via generative behavioural cloning, and without resorting to dynamic programming or TD-learning, we find three properties are key enablers for composition: shift equivariance, local receptive fields, and inference choices. We use these properties to explain architecture, data, and inference choices in existing generative BC methods based on diffusion planning including replanning frequency, data augmentation, and data scaling. Our experiments show that while local receptive fields are more important than shift equivariance in creating a diffusion planner capable of composition, both are crucial. Using findings from our experiments, we develop a new architecture for diffusion planners called Eq-Net, that is simple, produces diverse trajectories competitive with more computationally expensive methods such as replanning or scaling data, and can be guided to enable generalization in goal-conditioned settings. We show that Eq-Net exhibits significant compositional generalization in a variety of navigation and manipulation tasks designed to test planning diversity.
翻译:在策略学习中,拼接与组合泛化指的是策略能够将其训练数据中的子轨迹片段组合起来,以生成新颖且多样化行为的能力。尽管拼接已被认为是离线强化学习的重要优势,但近期的生成式行为克隆方法也展现出良好的拼接能力。然而,其背后的关键因素尚不明确,这阻碍了能够通过设计可靠实现拼接的新算法的发展。本文聚焦于通过生成式行为克隆训练的扩散规划器,且不依赖于动态规划或时序差分学习,我们发现三个特性是实现组合能力的关键:平移等变性、局部感受野和推理策略选择。我们运用这些特性来解释现有基于扩散规划的生成式行为克隆方法中的架构设计、数据选择及推理策略,包括重规划频率、数据增强和数据规模扩展。实验表明,在构建具备组合能力的扩散规划器时,局部感受野比平移等变性更为重要,但二者均不可或缺。基于实验发现,我们提出了一种名为Eq-Net的新型扩散规划器架构,该架构结构简洁,能生成与计算成本更高的方法(如重规划或数据扩展)相竞争的多样化轨迹,并可通过引导实现目标条件设定下的泛化能力。我们证明Eq-Net在多种为测试规划多样性而设计的导航与操作任务中,展现出显著的组合泛化性能。