We address key limitations in existing datasets and models for task-oriented hand-object interaction video generation, a critical approach of generating video demonstrations for robotic imitation learning. Current datasets, such as Ego4D, often suffer from inconsistent view perspectives and misaligned interactions, leading to reduced video quality and limiting their applicability for precise imitation learning tasks. Towards this end, we introduce TASTE-Rob -- a pioneering large-scale dataset of 100,856 ego-centric hand-object interaction videos. Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint to ensure interaction clarity. By fine-tuning a Video Diffusion Model (VDM) on TASTE-Rob, we achieve realistic object interactions, though we observed occasional inconsistencies in hand grasping postures. To enhance realism, we introduce a three-stage pose-refinement pipeline that improves hand posture accuracy in generated videos. Our curated dataset, coupled with the specialized pose-refinement framework, provides notable performance gains in generating high-quality, task-oriented hand-object interaction videos, resulting in achieving superior generalizable robotic manipulation. The TASTE-Rob dataset will be made publicly available upon publication to foster further advancements in the field.
翻译:我们针对任务导向手物交互视频生成这一为机器人模仿学习生成视频演示的关键方法,解决了现有数据集和模型中的主要局限性。当前数据集(如Ego4D)常存在视角不一致和交互未对齐的问题,导致视频质量下降,并限制了其在精确模仿学习任务中的适用性。为此,我们提出了TASTE-Rob——一个包含100,856个第一人称手物交互视频的开创性大规模数据集。每个视频均经过精心设计,与语言指令对齐,并从一致的摄像机视角录制,以确保交互的清晰度。通过在TASTE-Rob上微调视频扩散模型(VDM),我们实现了逼真的物体交互,但也观察到生成视频中手部抓握姿态偶有不一致。为增强真实感,我们引入了一个三阶段姿态优化流程,以提升生成视频中手部姿态的准确性。我们精心构建的数据集,结合专门的姿态优化框架,在生成高质量、任务导向的手物交互视频方面取得了显著性能提升,从而实现了更优的泛化机器人操作。TASTE-Rob数据集将在发表后公开,以促进该领域的进一步发展。