Learning a generalist embodied agent capable of completing multiple tasks poses challenges, primarily stemming from the scarcity of action-labeled robotic datasets. In contrast, a vast amount of human videos exist, capturing intricate tasks and interactions with the physical world. Promising prospects arise for utilizing actionless human videos for pre-training and transferring the knowledge to facilitate robot policy learning through limited robot demonstrations. In this paper, we introduce a novel framework that leverages a unified discrete diffusion to combine generative pre-training on human videos and policy fine-tuning on a small number of action-labeled robot videos. We start by compressing both human and robot videos into unified video tokens. In the pre-training stage, we employ a discrete diffusion model with a mask-and-replace diffusion strategy to predict future video tokens in the latent space. In the fine-tuning stage, we harness the imagined future videos to guide low-level action learning trained on a limited set of robot data. Experiments demonstrate that our method generates high-fidelity future videos for planning and enhances the fine-tuned policies compared to previous state-of-the-art approaches with superior generalization ability. Our project website is available at https://video-diff.github.io/.
翻译:学习具有多任务完成能力的通用具身智能体面临挑战,主要源于带动作标注的机器人数据集稀缺。相比之下,互联网上存在海量捕捉复杂任务及物理世界交互的无动作人类视频。利用无动作人类视频进行预训练并将其知识迁移至机器人策略学习(通过少量机器人演示)的研究前景广阔。本文提出一种融合生成式预训练与策略微调的新框架:采用统一离散扩散模型,在人类视频上进行生成式预训练后,通过少量带动作标注的机器人视频进行策略微调。我们首先将人类与机器人视频压缩为统一视频令牌。预训练阶段,采用掩码-替换扩散策略的离散扩散模型,在潜空间预测未来视频令牌。微调阶段,利用生成的未来视频指导基于有限机器人数据训练的低层动作学习。实验表明,相比现有最优方法,本方法能生成高保真未来视频用于规划,同时显著提升微调策略的泛化能力。项目网站:https://video-diff.github.io/