Imitation learning has proven to be a powerful tool for training complex visuomotor policies. However, current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations. A key reason for this poor data efficiency is that visual representations are predominantly either pretrained on out-of-domain data or trained directly through a behavior cloning objective. In this work, we present DynaMo, a new in-domain, self-supervised method for learning visual representations. Given a set of expert demonstrations, we jointly learn a latent inverse dynamics model and a forward dynamics model over a sequence of image embeddings, predicting the next frame in latent space, without augmentations, contrastive sampling, or access to ground truth actions. Importantly, DynaMo does not require any out-of-domain data such as Internet datasets or cross-embodied datasets. On a suite of six simulated and real environments, we show that representations learned with DynaMo significantly improve downstream imitation learning performance over prior self-supervised learning objectives, and pretrained representations. Gains from using DynaMo hold across policy classes such as Behavior Transformer, Diffusion Policy, MLP, and nearest neighbors. Finally, we ablate over key components of DynaMo and measure its impact on downstream policy performance. Robot videos are best viewed at https://dynamo-ssl.github.io
翻译:模仿学习已被证明是训练复杂视觉运动策略的有力工具。然而,当前方法通常需要数百至数千条专家示范才能处理高维视觉观测。这种数据效率低下的关键原因在于,视觉表征主要是在域外数据上预训练的,或是直接通过行为克隆目标训练的。本文提出DynaMo,一种新的域内自监督视觉表征学习方法。给定一组专家示范,我们联合学习潜在空间中的逆向动力学模型与基于图像嵌入序列的前向动力学模型,在无需数据增强、对比采样或真实动作信息的情况下,预测潜在空间中的下一帧。重要的是,DynaMo不需要任何域外数据(如互联网数据集或跨实体数据集)。在六个仿真与真实环境测试平台上,我们证明通过DynaMo学习的表征相较于先前的自监督学习目标及预训练表征,能显著提升下游模仿学习性能。使用DynaMo带来的性能增益在不同策略类别(如Behavior Transformer、Diffusion Policy、MLP及最近邻算法)中均保持稳定。最后,我们对DynaMo的关键组件进行消融研究,并评估其对下游策略性能的影响。机器人演示视频请访问 https://dynamo-ssl.github.io