Ambient Diffusion Policy: Imitation Learning from Suboptimal Data in Robotics

We propose Ambient Diffusion Policy, a simple and principled method for imitation learning from suboptimal data in robotics. High-quality, task-specific robot data is expensive and time-consuming to collect, while suboptimal datasets with lower-quality or out-of-distribution demonstrations are abundant. Existing methods that co-train on both data sources in robotics often fail to separate the meaningful and the harmful features in the suboptimal samples. In contrast, our method extracts only the useful features by introducing a new axis to co-training in robotics: noise-dependent data usage. Ambient Diffusion Policy restricts the contribution of suboptimal data during training to only the high and low diffusion times. To rigorously justify our approach, we first observe that robot action data exhibits a spectral power law. This induces two important properties on the optimal Diffusion Policy that we exploit: a global-to-local hierarchy and locality. We theoretically formalize this discussion using a simplified model. Our experiments validate Ambient Diffusion Policy on four types of suboptimal action data (noisy trajectories, sim-to-real gap, task mismatch, and large-scale data mixtures) across six tasks. The results show that it effectively learns from arbitrary sources of suboptimal data. Notably, it outperforms existing co-training baselines by up to 33% when scaled to Open X-Embodiment - a large dataset with heterogeneous data quality and unstructured distribution shifts. Overall, Ambient Diffusion Policy increases the utility of suboptimal demonstrations and expands the set of usable data sources in robotics.

翻译：我们提出环境扩散策略，一种从次优数据中进行机器人模仿学习的简洁且原则性方法。高质量、任务特定的机器人数据收集成本高昂且耗时，而包含低质量或分布外演示的次优数据集则十分丰富。现有在机器人领域对两类数据源进行联合训练的方法通常难以区分次优样本中有益与有害的特征。相比之下，我们的方法通过引入机器人协同训练的新维度——噪声依赖型数据使用——仅提取有用特征。环境扩散策略将训练期间次优数据的贡献限制在高、低扩散时间两个阶段。为严格论证该方法，我们首先观察到机器人动作数据呈现频谱幂律分布。这催生了最优扩散策略的两个关键属性：全局到局部的层次性与局部性。我们通过简化模型对此进行理论形式化。实验在六项任务上验证了环境扩散策略对四类次优动作数据（噪声轨迹、仿真-现实差距、任务不匹配及大规模数据混合）的有效性。结果表明该方法能有效学习任意来源的次优数据。值得注意的是，当扩展至Open X-Embodiment——一个具有异构数据质量与无结构分布偏移的大规模数据集时，该方法比现有联合训练基线性能提升高达33%。总体而言，环境扩散策略提升了次优演示的实用性，并扩展了机器人领域可用数据源的范围。