Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning

Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior. As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws. Therefore, continuing with the current architectures will present a computational roadblock. To address this gap, we propose Mixture-of-Denoising Experts (MoDE) as a novel policy for Imitation Learning. MoDE surpasses current state-of-the-art Transformer-based Diffusion Policies while enabling parameter-efficient scaling through sparse experts and noise-conditioned routing, reducing both active parameters by 40% and inference costs by 90% via expert caching. Our architecture combines this efficient scaling with noise-conditioned self-attention mechanism, enabling more effective denoising across different noise levels. MoDE achieves state-of-the-art performance on 134 tasks in four established imitation learning benchmarks (CALVIN and LIBERO). Notably, by pretraining MoDE on diverse robotics data, we achieve 4.01 on CALVIN ABC and 0.95 on LIBERO-90. It surpasses both CNN-based and Transformer Diffusion Policies by an average of 57% across 4 benchmarks, while using 90% fewer FLOPs and fewer active parameters compared to default Diffusion Transformer architectures. Furthermore, we conduct comprehensive ablations on MoDE's components, providing insights for designing efficient and scalable Transformer architectures for Diffusion Policies. Code and demonstrations are available at https://mbreuss.github.io/MoDE_Diffusion_Policy/.

翻译：扩散策略已在模仿学习中得到广泛应用，其具备生成多模态与不连续行为等吸引人的特性。随着模型规模不断扩大以捕捉更复杂能力，其计算需求亦随之增加，近期缩放定律已印证此趋势。因此，延续现有架构将面临计算瓶颈。为应对这一挑战，我们提出混合去噪专家（MoDE）作为模仿学习的新型策略。MoDE通过稀疏专家与噪声条件路由机制，在超越当前基于Transformer的扩散策略的同时，实现了参数高效的扩展：借助专家缓存技术，将激活参数量降低40%，推理成本减少90%。该架构将高效扩展与噪声条件自注意力机制相结合，从而在不同噪声水平上实现更有效的去噪。MoDE在四个成熟的模仿学习基准测试（CALVIN与LIBERO）的134项任务中取得了最先进的性能。值得注意的是，通过对MoDE进行多样化机器人数据预训练，我们在CALVIN ABC上获得4.01分，在LIBERO-90上获得0.95分。相较于基于CNN和Transformer的扩散策略，MoDE在四个基准测试中平均领先57%，同时比默认扩散Transformer架构减少90%的浮点运算量并使用更少的激活参数。此外，我们对MoDE各组件进行了全面消融实验，为设计高效可扩展的扩散策略Transformer架构提供了重要参考。代码与演示视频发布于https://mbreuss.github.io/MoDE_Diffusion_Policy/。