Deep generative models have recently emerged as an effective approach to offline reinforcement learning. However, their large model size poses challenges in computation. We address this issue by proposing a knowledge distillation method based on data augmentation. In particular, high-return trajectories are generated from a conditional diffusion model, and they are blended with the original trajectories through a novel stitching algorithm that leverages a new reward generator. Applying the resulting dataset to behavioral cloning, the learned shallow policy whose size is much smaller outperforms or nearly matches deep generative planners on several D4RL benchmarks.
翻译:深度生成模型近期在离线强化学习中展现出有效性能,但其大规模模型带来的计算挑战不容忽视。针对该问题,我们提出一种基于数据增强的知识蒸馏方法。具体而言,通过条件扩散模型生成高回报轨迹,并借助新型奖励生成器实现的拼接算法将其与原始轨迹融合。将生成的增强数据集应用于行为克隆,所得浅层策略模型在保持极小规模的同时,在多个D4RL基准测试中性能超越或接近深度生成式规划器。