Towards Generalist Robot Learning from Internet Video: A Survey

This survey presents an overview of methods for learning from video (LfV) in the context of reinforcement learning (RL) and robotics. We focus on methods capable of scaling to large internet video datasets and, in the process, extracting foundational knowledge about the world's dynamics and physical human behaviour. Such methods hold great promise for developing general-purpose robots. We open with an overview of fundamental concepts relevant to the LfV-for-robotics setting. This includes a discussion of the exciting benefits LfV methods can offer (e.g., improved generalization beyond the available robot data) and commentary on key LfV challenges (e.g., challenges related to missing information in video and LfV distribution shifts). Our literature review begins with an analysis of video foundation model techniques that can extract knowledge from large, heterogeneous video datasets. Next, we review methods that specifically leverage video data for robot learning. Here, we categorise work according to which RL knowledge modality benefits from the use of video data. We additionally highlight techniques for mitigating LfV challenges, including reviewing action representations that address the issue of missing action labels in video. Finally, we examine LfV datasets and benchmarks, before concluding the survey by discussing challenges and opportunities in LfV. Here, we advocate for scalable approaches that can leverage the full range of available data and that target the key benefits of LfV. Overall, we hope this survey will serve as a comprehensive reference for the emerging field of LfV, catalysing further research in the area, and ultimately facilitating progress towards obtaining general-purpose robots.

翻译：本综述概述了强化学习与机器人领域内从视频中学习的方法，重点分析能够扩展至大规模互联网视频数据集、并从中提取世界动态规律及人类物理行为基础知识的方案。此类方法为开发通用机器人提供了重要潜力。我们首先介绍与机器人视频学习相关的基本概念，包括视频学习方法带来的显著优势（如超越现有机器人数据范围的泛化能力提升），以及关键挑战（如视频信息缺失和分布偏移问题）。文献分析从可提取异构大规模视频数据集知识的视频基础模型技术切入，继而系统梳理利用视频数据赋能机器人学习的研究，并根据强化学习知识模态的不同类型对成果进行分类。我们特别强调应对视频学习挑战的技术手段，包括解决视频中动作标签缺失问题的动作表示方法。最后，在考察现有视频学习数据集与基准测试后，探讨该领域面临的挑战与发展机遇，倡导开发能够充分利用全部可用数据、聚焦视频学习核心优势的可扩展方案。本综述旨在为新兴的视频学习领域提供全面参考文献，推动相关研究深化，最终促进通用机器人的研发进程。