Latent Action Models (LAMs) have rapidly gained traction as an important component in the pre-training pipelines of leading Vision-Language-Action models. However, they fail when observations contain action-correlated distractors, often encoding noise instead of meaningful latent actions. Humans, on the other hand, can effortlessly distinguish task-relevant motions from irrelevant details in any video given only a brief task description. In this work, we propose to utilize the common-sense reasoning abilities of Vision-Language Models (VLMs) to provide promptable representations, effectively separating controllable changes from the noise in unsupervised way. We use these representations as targets during LAM training and benchmark a wide variety of popular VLMs, revealing substantial variation in the quality of promptable representations as well as their robustness to different prompts and hyperparameters. Interestingly, we find that more recent VLMs may perform worse than older ones. Finally, we show that simply asking VLMs to ignore distractors can substantially improve latent action quality, yielding up to a six-fold increase in downstream success rates on Distracting MetaWorld.
翻译:潜在动作模型(LAMs)已迅速成为领先的视觉-语言-动作模型预训练流程中的重要组成部分。然而,当观测数据包含与动作相关的干扰因素时,它们往往会编码噪声而非有意义的潜在动作,从而导致失效。相比之下,人类仅凭简短的任务描述,就能轻松从任何视频中区分出任务相关运动与无关细节。在本研究中,我们提出利用视觉-语言模型(VLMs)的常识推理能力来提供可提示的表征,从而以无监督方式有效分离可控变化与噪声。我们将这些表征作为LAM训练的目标,并对多种主流VLM进行基准测试,揭示了可提示表征质量的显著差异及其对不同提示与超参数的鲁棒性。有趣的是,我们发现较新的VLM可能反而不如早期模型。最后,我们证明仅需要求VLM忽略干扰因素即可显著提升潜在动作质量,在Distracting MetaWorld下游任务中成功率最高可提升六倍。