This work introduces Variational Diffusion Distillation (VDD), a novel method that distills denoising diffusion policies into Mixtures of Experts (MoE) through variational inference. Diffusion Models are the current state-of-the-art in generative modeling due to their exceptional ability to accurately learn and represent complex, multi-modal distributions. This ability allows Diffusion Models to replicate the inherent diversity in human behavior, making them the preferred models in behavior learning such as Learning from Human Demonstrations (LfD). However, diffusion models come with some drawbacks, including the intractability of likelihoods and long inference times due to their iterative sampling process. The inference times, in particular, pose a significant challenge to real-time applications such as robot control. In contrast, MoEs effectively address the aforementioned issues while retaining the ability to represent complex distributions but are notoriously difficult to train. VDD is the first method that distills pre-trained diffusion models into MoE models, and hence, combines the expressiveness of Diffusion Models with the benefits of Mixture Models. Specifically, VDD leverages a decompositional upper bound of the variational objective that allows the training of each expert separately, resulting in a robust optimization scheme for MoEs. VDD demonstrates across nine complex behavior learning tasks, that it is able to: i) accurately distill complex distributions learned by the diffusion model, ii) outperform existing state-of-the-art distillation methods, and iii) surpass conventional methods for training MoE.
翻译:本文提出了变分扩散蒸馏(VDD),一种通过变分推断将去噪扩散策略蒸馏为专家混合模型(MoE)的新方法。扩散模型因其精确学习和表示复杂多模态分布的卓越能力,已成为当前生成建模领域的先进技术。这种能力使扩散模型能够复现人类行为固有的多样性,使其成为行为学习(如从人类演示中学习)的首选模型。然而,扩散模型存在一些缺陷,包括似然计算的难处理性以及迭代采样过程导致的较长推理时间。推理时间问题尤其对机器人控制等实时应用构成重大挑战。相比之下,MoE在保持表示复杂分布能力的同时,有效解决了上述问题,但其训练 notoriously 困难。VDD是首个将预训练扩散模型蒸馏为MoE模型的方法,从而将扩散模型的表达能力与混合模型的优势相结合。具体而言,VDD利用变分目标的分解上界,允许分别训练每个专家,从而为MoE提供了鲁棒的优化方案。在九个复杂行为学习任务上的实验表明,VDD能够:i)精确蒸馏扩散模型学习的复杂分布;ii)超越现有先进的蒸馏方法;iii)优于传统的MoE训练方法。