Scaling deep learning to massive, diverse internet data has yielded remarkably general capabilities in visual and natural language understanding and generation. However, data has remained scarce and challenging to collect in robotics, seeing robot learning struggle to obtain similarly general capabilities. Promising Learning from Videos (LfV) methods aim to address the robotics data bottleneck by augmenting traditional robot data with large-scale internet video data. This video data offers broad foundational information regarding physical behaviour and the underlying physics of the world, and thus can be highly informative for a generalist robot. In this survey, we present a thorough overview of the emerging field of LfV. We outline fundamental concepts, including the benefits and challenges of LfV. We provide a comprehensive review of current methods for extracting knowledge from large-scale internet video, addressing key challenges in LfV, and boosting downstream robot and reinforcement learning via the use of video data. The survey concludes with a critical discussion of challenges and opportunities in LfV. Here, we advocate for scalable foundation model approaches that can leverage the full range of available internet video to improve the learning of robot policies and dynamics models. We hope this survey can inform and catalyse further LfV research, driving progress towards the development of general-purpose robots.
翻译:将深度学习扩展至海量、多样化的互联网数据,已在视觉与自然语言理解及生成领域催生出显著通用的能力。然而,机器人领域的数据仍然稀缺且难以收集,导致机器人学习难以获得类似的通用能力。前景广阔的“从视频中学习”方法旨在通过将大规模互联网视频数据与传统机器人数据相结合,以解决机器人领域的数据瓶颈问题。这类视频数据提供了关于物理行为及世界底层物理规律的广泛基础信息,因此对通用型机器人具有高度参考价值。本综述对这一新兴的“从视频中学习”领域进行了全面概述。我们阐述了其基本概念,包括该方法的优势与挑战。系统回顾了当前从大规模互联网视频中提取知识的方法,探讨了“从视频中学习”中的关键挑战,并阐述了如何利用视频数据促进下游机器人及强化学习。综述最后对“从视频中学习”面临的挑战与机遇进行了批判性讨论。在此,我们倡导采用可扩展的基础模型方法,以充分利用现有互联网视频的全部范围,从而改进机器人策略及动力学模型的学习。我们希望本综述能为“从视频中学习”的进一步研究提供信息参考并推动其发展,从而促进通用机器人的研发进程。