Leveraging vast amounts of unlabeled internet video data for embodied AI is currently bottlenecked by the lack of action labels and the presence of action-correlated visual distractors. Although recent latent action policy optimization (LAPO) has shown promise in inferring proxy action labels from visual observations, its performance degrades significantly when distractors are present. To address this limitation, we propose a novel object-centric latent action learning framework that centers on objects rather than pixels. We leverage self-supervised object-centric pretraining to disentangle the movement of the agent and distracting background dynamics. This allows LAPO to focus on task-relevant interactions, resulting in more robust proxy-action labels, enabling better imitation learning and efficient adaptation of the agent with just a few action-labeled trajectories. We evaluated our method in eight visually complex tasks across the Distracting Control Suite (DCS) and Distracting MetaWorld (DMW). Our results show that object-centric pretraining mitigates the negative effects of distractors by 50%, as measured by downstream task performance: average return (DCS) and success rate (DMW).
翻译:利用海量未标记互联网视频数据进行具身人工智能研究,目前面临两大瓶颈:动作标签的缺失以及动作相关视觉干扰物的存在。尽管近期的潜在动作策略优化(LAPO)方法在从视觉观测中推断代理动作标签方面显示出潜力,但其在存在干扰物时性能会显著下降。为克服这一局限,我们提出了一种新颖的面向对象的潜在动作学习框架,其核心关注对象而非像素。我们利用自监督的面向对象预训练来解耦智能体运动与干扰性背景动态。这使得LAPO能够专注于任务相关的交互,从而产生更鲁棒的代理动作标签,仅需少量带动作标签的轨迹即可实现更好的模仿学习与智能体的高效适应。我们在Distracting Control Suite(DCS)和Distracting MetaWorld(DMW)的八个视觉复杂任务中评估了我们的方法。结果表明,通过下游任务性能指标——平均回报(DCS)和成功率(DMW)衡量,面向对象的预训练将干扰物的负面影响降低了50%。