Given a dataset of expert trajectories, standard imitation learning approaches typically learn a direct mapping from observations (e.g., RGB images) to actions. However, such methods often overlook the rich interplay between different modalities, i.e., sensory inputs, actions, and rewards, which is crucial for modeling robot behavior and understanding task outcomes. In this work, we propose Multimodal Diffusion Forcing, a unified framework for learning from multimodal robot trajectories that extends beyond action generation. Rather than modeling a fixed distribution, MDF applies random partial masking and trains a diffusion model to reconstruct the trajectory. This training objective encourages the model to learn temporal and cross-modal dependencies, such as predicting the effects of actions on force signals or inferring states from partial observations. We evaluate MDF on contact-rich, forceful manipulation tasks in simulated and real-world environments. Our results show that MDF not only delivers versatile functionalities, but also achieves strong performance, and robustness under noisy observations. More visualizations can be found on our website https://unified-df.github.io
翻译:给定专家轨迹数据集,标准的模仿学习方法通常学习从观测(例如RGB图像)到动作的直接映射。然而,此类方法往往忽略了不同模态(即感官输入、动作与奖励)之间丰富的相互作用,而这种相互作用对于建模机器人行为和理解任务结果至关重要。在本工作中,我们提出多模态扩散强制,这是一个用于从多模态机器人轨迹中学习的统一框架,其功能超越了动作生成。MDF并非建模固定分布,而是应用随机部分掩码并训练扩散模型以重建轨迹。该训练目标促使模型学习时序和跨模态依赖关系,例如预测动作对力信号的影响,或从部分观测推断状态。我们在仿真和真实环境中的接触密集型强力操控任务上评估MDF。结果表明,MDF不仅提供了多样化的功能,而且实现了强劲的性能,并在噪声观测下保持了鲁棒性。更多可视化内容请访问我们的网站 https://unified-df.github.io。