While the shortage of explicit action data limits Vision-Language-Action (VLA) models, human action videos offer a scalable yet unlabeled data source. A critical challenge in utilizing large-scale human video datasets lies in transforming visual signals into ontology-independent representations, known as latent actions. However, the capacity of latent action representation to derive robust control from visual observations has yet to be rigorously evaluated. We introduce the Latent Action Representation Yielding (LARY) Benchmark, a unified framework for evaluating latent action representations on both high-level semantic actions (what to do) and low-level robotic control (how to do). The comprehensively curated dataset encompasses over one million videos (1,000 hours) spanning 151 action categories, alongside 620K image pairs and 595K motion trajectories across diverse embodiments and environments. Our experiments reveal two crucial insights: (i) General visual foundation models, trained without any action supervision, consistently outperform specialized embodied latent action models. (ii) Latent-based visual space is fundamentally better aligned to physical action space than pixel-based space. These results suggest that general visual representations inherently encode action-relevant knowledge for physical control, and that semantic-level abstraction serves as a fundamentally more effective pathway from vision to action than pixel-level reconstruction.
翻译:尽管显式动作数据的短缺限制了视觉-语言-动作(VLA)模型的发展,人类动作视频却提供了可扩展但未标注的数据源。将大规模人类视频数据集有效利用的关键挑战在于:将视觉信号转化为与本体无关的表示,即潜在动作。然而,潜在动作表示从视觉观测中提取稳健控制能力的性能尚未得到严格评估。本文提出潜在动作表示生成(LARY)基准——一个统一的评估框架,用于在高层语义动作(做什么)和低层机器人控制(如何做)两个层面全面评估潜在动作表示。该综合精选数据集涵盖超过一百万段视频(1000小时),横跨151个动作类别,同时包含62万组图像对和59.5万条跨不同载体与环境中的运动轨迹。实验揭示两个关键发现:(i)未经任何动作监督训练的通用视觉基础模型,始终优于专用具身潜在动作模型;(ii)基于潜在空间的视觉表征相较于基于像素空间的表征,本质上更适配物理动作空间。这些结果表明通用视觉表征内在地编码了用于物理控制的动作相关知识,且语义级抽象比像素级重建更根本性地连接了视觉到动作的有效通路。