面向机器人操作的可扩展视觉-语言-动作模型预训练：基于真实人类活动视频 (Scalable Vision-Language-Action Model Pretraining for Robotic Manipulation with Real-Life Human Activity Videos)

Qixiu Li,Yu Deng,Yaobo Liang,Lin Luo,Lei Zhou,Chengtang Yao,Lingqi Zeng,Zhiyuan Feng,Huizhi Liang,Sicheng Xu,Yizhong Zhang,Xi Chen,Hao Chen,Lily Sun,Dong Chen,Jiaolong Yang,Baining Guo

from arxiv, Project page: https://microsoft.github.io/VITRA/

This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.

翻译：本文提出了一种新颖的方法，利用大量无脚本的真实人类手部活动视频记录，对机器人操作的视觉-语言-动作模型进行预训练。通过将人手视为灵巧的机器人末端执行器，我们证明了无需任何标注的"野外"第一人称人类视频，可以在任务粒度和标签方面完全转化为与现有机器人V-L-A训练数据格式对齐的数据。这是通过开发一种适用于任意人类手部视频的全自动整体人类活动分析方法实现的。该方法能够生成原子级别的手部活动片段及其语言描述，每个片段都附带逐帧的3D手部运动和相机运动信息。我们处理了大量第一人称视频，创建了一个包含100万个片段和2600万帧的手部VLA训练数据集。该训练数据涵盖了现实世界中广泛的对象与概念、灵巧操作任务以及环境变化，其覆盖范围远超现有的机器人数据。我们设计了一个灵巧手VLA模型架构，并在此数据集上对模型进行预训练。该模型在完全未见过的真实世界观测数据上表现出强大的零样本能力。此外，在少量真实机器人动作数据上进行微调，能显著提高真实机器人实验中的任务成功率以及对新对象的泛化能力。我们还展示了模型任务性能随预训练数据规模扩展的良好缩放特性。我们相信这项工作为可扩展的VLA预训练奠定了坚实基础，推动机器人向真正可泛化的具身智能迈进。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日