Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. However, it remains a challenge due to the domain gap between humans and robots. Moreover, it is difficult to extract useful information representing the dynamic world from human videos, because of its noisy and multimodal data structure. In this paper, we introduce a novel framework to tackle these challenges, which leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning with a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior performance. Our project website is available at https://video-diff.github.io/.
翻译:学习能够完成多种任务的通用具身智能体面临挑战,主要源于带动作标注的机器人数据集稀缺。相比之下,存在大量人类视频,捕捉了与物理世界交互的复杂任务。利用无动作人类视频进行预训练,并通过有限的机器人演示将知识迁移以促进机器人策略学习,展现出广阔前景。然而,由于人类与机器人之间的领域差异,这仍具挑战性。此外,由于人类视频具有噪声和多模态数据结构,难以从中提取表征动态世界的有用信息。本文提出一种新颖框架应对这些挑战,该框架利用统一的离散扩散模型,结合人类视频的生成式预训练和少量带动作标注机器人视频的策略微调。我们首先将人类和机器人视频压缩为统一的视频标记。在预训练阶段,采用具有掩码替换扩散策略的离散扩散模型来预测潜在空间中的未来视频标记。在微调阶段,利用想象出的未来视频指导有限机器人数据下的低级动作学习。实验表明,与先前最先进方法相比,我们的方法能生成用于规划的高保真未来视频,并提升了微调策略的性能。项目网站详见 https://video-diff.github.io/。