Despite progress, Vision-Language-Action models (VLAs) are limited by a scarcity of large-scale, diverse robot data. While human manipulation videos offer a rich alternative, existing methods are forced to choose between small, precisely-labeled datasets and vast in-the-wild footage with unreliable hand tracking labels. We present JALA, a pretraining framework that learns Jointly-Aligned Latent Actions. JALA bypasses full visual dynamic reconstruction, instead learns a predictive action embedding aligned with both inverse dynamics and real actions. This yields a transition-aware, behavior-centric latent space for learning from heterogeneous human data. We scale this approach with UniHand-Mix, a 7.5M video corpus (>2,000 hours) blending laboratory and in-the-wild footage. Experiments demonstrate that JALA generates more realistic hand motions in both controlled and unconstrained scenarios, significantly improving downstream robot manipulation performance in both simulation and real-world tasks. These results indicate that jointly-aligned latent actions offer a scalable pathway for VLA pretraining from human data.
翻译:尽管取得了进展,视觉-语言-动作模型(VLAs)仍受限于大规模、多样化机器人数据的稀缺。虽然人类操作视频提供了丰富的替代资源,但现有方法被迫在小型、精确标注的数据集与具有不可靠手部追踪标签的海量野外视频之间做出选择。我们提出了JALA,一种学习联合对齐潜在动作的预训练框架。JALA绕过了完整的视觉动态重建,转而学习一种与逆向动力学及真实动作均对齐的预测性动作嵌入。这产生了一个用于从异构人类数据中学习的、具有状态转移感知且以行为为中心的潜在空间。我们通过UniHand-Mix(一个包含750万视频片段、时长超过2000小时、融合了实验室与野外素材的语料库)来扩展此方法。实验表明,JALA在受控和非受约束场景下均能生成更逼真的手部运动,并显著提升了模拟及真实世界任务中下游机器人操作的性能。这些结果表明,联合对齐的潜在动作为利用人类数据进行VLA预训练提供了一条可扩展的途径。