Agents capable of reasoning and planning in the real world require the ability of predicting the consequences of their actions. While world models possess this capability, they most often require action labels, that can be complex to obtain at scale. This motivates the learning of latent action models, that can learn an action space from videos alone. Our work addresses the problem of learning latent actions world models on in-the-wild videos, expanding the scope of existing works that focus on simple robotics simulations, video games, or manipulation data. While this allows us to capture richer actions, it also introduces challenges stemming from the video diversity, such as environmental noise, or the lack of a common embodiment across videos. To address some of the challenges, we discuss properties that actions should follow as well as relevant architectural choices and evaluations. We find that continuous, but constrained, latent actions are able to capture the complexity of actions from in-the-wild videos, something that the common vector quantization does not. We for example find that changes in the environment coming from agents, such as humans entering the room, can be transferred across videos. This highlights the capability of learning actions that are specific to in-the-wild videos. In the absence of a common embodiment across videos, we are mainly able to learn latent actions that become localized in space, relative to the camera. Nonetheless, we are able to train a controller that maps known actions to latent ones, allowing us to use latent actions as a universal interface and solve planning tasks with our world model with similar performance as action-conditioned baselines. Our analyses and experiments provide a step towards scaling latent action models to the real world.
翻译:能够在现实世界中推理和规划的智能体需要具备预测其动作后果的能力。虽然世界模型拥有这种能力,但它们通常需要动作标签,而这些标签在大规模获取时可能很复杂。这促使了潜在动作模型的学习,该模型可以仅从视频中学习动作空间。我们的工作解决了在真实世界视频中学习潜在动作世界模型的问题,扩展了现有工作(主要关注简单的机器人仿真、视频游戏或操作数据)的范围。虽然这使我们能够捕捉更丰富的动作,但也引入了源自视频多样性的挑战,例如环境噪声,或视频间缺乏共同的具身性。为了应对部分挑战,我们讨论了动作应遵循的属性以及相关的架构选择和评估方法。我们发现,连续但受约束的潜在动作能够捕捉真实世界视频中动作的复杂性,而常见的向量量化方法则无法做到这一点。例如,我们发现由智能体(如人类进入房间)引起的环境变化可以在不同视频间迁移。这突显了学习特定于真实世界视频的动作的能力。在视频间缺乏共同具身性的情况下,我们主要能够学习到在空间中(相对于摄像机)局部化的潜在动作。尽管如此,我们能够训练一个控制器,将已知动作映射到潜在动作,从而使我们能够将潜在动作用作通用接口,并利用我们的世界模型解决规划任务,其性能与基于动作条件的基线方法相当。我们的分析和实验为将潜在动作模型扩展到现实世界迈出了一步。