Robot learning of manipulation skills is hindered by the scarcity of diverse, unbiased datasets. While curated datasets can help, challenges remain in generalizability and real-world transfer. Meanwhile, large-scale "in-the-wild" video datasets have driven progress in computer vision through self-supervised techniques. Translating this to robotics, recent works have explored learning manipulation skills by passively watching abundant videos sourced online. Showing promising results, such video-based learning paradigms provide scalable supervision while reducing dataset bias. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources, as well as emerging techniques for acquiring robot manipulation skills from uncontrolled video demonstrations. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation. The survey summarizes video-based learning approaches, analyses their benefits over standard datasets, survey metrics, and benchmarks, and discusses open challenges and future directions in this nascent domain at the intersection of computer vision, natural language processing, and robot learning.
翻译:机器人操作技能的学习因缺乏多样、无偏的数据集而受到阻碍。尽管精心整理的数据集能有所帮助,但在泛化能力和现实世界迁移方面仍存在挑战。与此同时,大规模"野外"视频数据集通过自监督技术推动了计算机视觉领域的进展。将这一思路应用于机器人领域,近期研究探索了通过被动观察海量在线视频来学习操作技能。这种基于视频的学习范式展现出令人振奋的前景,不仅提供了可扩展的监督信号,还减少了数据集偏差。本综述回顾了基础性工作,包括视频特征表示学习技术、物体功能理解、3D手部/人体建模和大规模机器人资源,以及从非受控视频演示中获取机器人操作技能的新兴技术。我们探讨了仅通过观察大规模人类视频进行学习如何能够提升机器人操作的泛化能力和样本效率。本文总结了基于视频的学习方法,分析了它们相较于标准数据集的优势,梳理了评估指标和基准测试,并讨论了这一融合计算机视觉、自然语言处理与机器人学习的交叉新兴领域中的开放性挑战与未来发展方向。