Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $π_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is able to transfer learned knowledge across tasks, environments, and embodiments. It outperforms models pretrained with ground-truth robot actions and other similar pretraining methods on the LIBERO benchmark and real-world setup, while being efficient and practical for real-world settings.
翻译:视觉-语言-动作(VLA)模型在执行语言指令的机器人操作任务学习中广受欢迎。最先进的VLA模型(如OpenVLA和$π_{0}$)是在通过远程操作收集的大规模人工标注动作数据集上训练的。近期方法(包括LAPA和villa-X)引入了潜在动作表征,通过建模帧间抽象视觉变化,实现在无标注数据集上的无监督预训练。虽然这些方法展现了强劲性能,但其庞大的模型规模使其在真实场景部署中面临挑战。本文提出LAWM——一种与模型无关的框架,通过世界建模从无标注视频数据中学习潜在动作表征,以自监督方式预训练模仿学习模型。这些视频可来源于机器人记录或人类使用日常物品执行动作的视频。该框架能跨任务、跨环境、跨实体迁移所学知识,在LIBERO基准测试和真实场景中,其性能优于使用真实机器人动作预训练的模型及其他同类预训练方法,同时兼具高效性与实用性。