将扩散策略扩展至Transformer的十亿参数规模用于机器人操作 (Scaling Diffusion Policy in Transformer to 1 Billion Parameters for Robotic Manipulation)

Diffusion Policy is a powerful technique tool for learning end-to-end visuomotor robot control. It is expected that Diffusion Policy possesses scalability, a key attribute for deep neural networks, typically suggesting that increasing model size would lead to enhanced performance. However, our observations indicate that Diffusion Policy in transformer architecture (\DP) struggles to scale effectively; even minor additions of layers can deteriorate training outcomes. To address this issue, we introduce Scalable Diffusion Transformer Policy for visuomotor learning. Our proposed method, namely \textbf{\methodname}, introduces two modules that improve the training dynamic of Diffusion Policy and allow the network to better handle multimodal action distribution. First, we identify that \DP~suffers from large gradient issues, making the optimization of Diffusion Policy unstable. To resolve this issue, we factorize the feature embedding of observation into multiple affine layers, and integrate it into the transformer blocks. Additionally, our utilize non-causal attention which allows the policy network to \enquote{see} future actions during prediction, helping to reduce compounding errors. We demonstrate that our proposed method successfully scales the Diffusion Policy from 10 million to 1 billion parameters. This new model, named \methodname, can effectively scale up the model size with improved performance and generalization. We benchmark \methodname~across 50 different tasks from MetaWorld and find that our largest \methodname~outperforms \DP~with an average improvement of 21.6\%. Across 7 real-world robot tasks, our ScaleDP demonstrates an average improvement of 36.25\% over DP-T on four single-arm tasks and 75\% on three bimanual tasks. We believe our work paves the way for scaling up models for visuomotor learning. The project page is available at scaling-diffusion-policy.github.io.

翻译：扩散策略是一种用于学习端到端视觉运动机器人控制的强大技术工具。人们预期扩散策略应具备可扩展性——这是深度神经网络的关键属性，通常意味着增加模型规模会带来性能提升。然而，我们的观察表明，基于Transformer架构的扩散策略（\DP）难以有效扩展；即使仅轻微增加层数也会导致训练效果恶化。为解决此问题，我们提出了用于视觉运动学习的可扩展扩散Transformer策略。我们提出的方法，即\textbf{\methodname}，引入了两个模块以改善扩散策略的训练动态，并使网络能更好地处理多模态动作分布。首先，我们发现\DP~存在梯度幅值过大的问题，导致扩散策略的优化不稳定。为解决此问题，我们将观测的特征嵌入分解为多个仿射层，并将其集成到Transformer块中。此外，我们采用非因果注意力机制，使策略网络在预测时能够“看到”未来动作，有助于减少误差累积。我们证明了所提出的方法成功地将扩散策略从1000万参数扩展至10亿参数。这个新模型被命名为\methodname，能够有效扩大模型规模，同时提升性能和泛化能力。我们在MetaWorld的50个不同任务上对\methodname~进行了基准测试，发现我们最大的\methodname~模型平均性能优于\DP~达21.6\%。在7个真实世界机器人任务中，我们的ScaleDP在四个单臂任务上平均比DP-T提升36.25\%，在三个双手任务上提升75\%。我们相信这项工作为扩展视觉运动学习模型铺平了道路。项目页面位于scaling-diffusion-policy.github.io。