This paper presents a novel approach for pretraining robotic manipulation Vision-Language-Action (VLA) models using a large corpus of unscripted real-life video recordings of human hand activities. Treating human hand as dexterous robot end-effector, we show that "in-the-wild" egocentric human videos without any annotations can be transformed into data formats fully aligned with existing robotic V-L-A training data in terms of task granularity and labels. This is achieved by the development of a fully-automated holistic human activity analysis approach for arbitrary human hand videos. This approach can generate atomic-level hand activity segments and their language descriptions, each accompanied with framewise 3D hand motion and camera motion. We process a large volume of egocentric videos and create a hand-VLA training dataset containing 1M episodes and 26M frames. This training data covers a wide range of objects and concepts, dexterous manipulation tasks, and environment variations in real life, vastly exceeding the coverage of existing robot data. We design a dexterous hand VLA model architecture and pretrain the model on this dataset. The model exhibits strong zero-shot capabilities on completely unseen real-world observations. Additionally, fine-tuning it on a small amount of real robot action data significantly improves task success rates and generalization to novel objects in real robotic experiments. We also demonstrate the appealing scaling behavior of the model's task performance with respect to pretraining data scale. We believe this work lays a solid foundation for scalable VLA pretraining, advancing robots toward truly generalizable embodied intelligence.
翻译:本文提出了一种新颖的方法,利用大量无脚本的真实人类手部活动视频记录,对机器人操作的视觉-语言-动作模型进行预训练。通过将人手视为灵巧的机器人末端执行器,我们证明了无需任何标注的"野外"第一人称人类视频,可以在任务粒度和标签方面完全转化为与现有机器人V-L-A训练数据格式对齐的数据。这是通过开发一种适用于任意人类手部视频的全自动整体人类活动分析方法实现的。该方法能够生成原子级别的手部活动片段及其语言描述,每个片段都附带逐帧的3D手部运动和相机运动信息。我们处理了大量第一人称视频,创建了一个包含100万个片段和2600万帧的手部VLA训练数据集。该训练数据涵盖了现实世界中广泛的对象与概念、灵巧操作任务以及环境变化,其覆盖范围远超现有的机器人数据。我们设计了一个灵巧手VLA模型架构,并在此数据集上对模型进行预训练。该模型在完全未见过的真实世界观测数据上表现出强大的零样本能力。此外,在少量真实机器人动作数据上进行微调,能显著提高真实机器人实验中的任务成功率以及对新对象的泛化能力。我们还展示了模型任务性能随预训练数据规模扩展的良好缩放特性。我们相信这项工作为可扩展的VLA预训练奠定了坚实基础,推动机器人向真正可泛化的具身智能迈进。