Learning \emph{latent actions} from diverse human videos enables scaling robot learning beyond embodiment-specific robot datasets, and these latent actions have recently been used as pseudo-action labels for vision-language-action (VLA) model pretraining. To make VLA pretraining effective, latent actions should contain information about the underlying agent's actions despite the absence of ground-truth labels. We propose \textbf{M}ulti-\textbf{V}iew\textbf{P}oint \textbf{L}atent \textbf{A}ction \textbf{M}odel (\textbf{MVP-LAM}), which learns discrete latent actions that are highly informative about ground-truth actions from time-synchronized multi-view videos. MVP-LAM trains latent actions with a \emph{cross-viewpoint reconstruction} objective, so that a latent action inferred from one view must explain the future in another view, reducing reliance on viewpoint-specific cues. On Bridge V2, MVP-LAM produces more action-centric latent actions, achieving higher mutual information with ground-truth actions and improved action prediction, including under out-of-distribution evaluation. Finally, pretraining VLAs with MVP-LAM latent actions improves downstream manipulation performance on the SIMPLER and LIBERO-Long benchmarks.
翻译:从多样化的人类视频中学习\emph{潜在动作},能够将机器人学习扩展到特定具身机器人数据集之外,这些潜在动作最近已被用作视觉-语言-动作(VLA)模型预训练的伪动作标签。为了使VLA预训练有效,尽管缺乏真实标签,潜在动作仍应包含关于智能体底层动作的信息。我们提出了\textbf{多视角潜在动作模型(MVP-LAM)},它从时间同步的多视角视频中学习与真实动作高度信息相关的离散潜在动作。MVP-LAM通过\emph{跨视角重构}目标训练潜在动作,使得从一个视角推断出的潜在动作必须能够解释另一视角的未来状态,从而减少对视角特定线索的依赖。在Bridge V2数据集上,MVP-LAM生成了更具动作中心性的潜在动作,实现了与真实动作更高的互信息以及改进的动作预测性能,包括在分布外评估中。最后,使用MVP-LAM的潜在动作预训练VLA模型,提升了在SIMPLER和LIBERO-Long基准测试上的下游操作性能。