While pre-trained visual representations have significantly advanced imitation learning, they are often task-agnostic as they remain frozen during policy learning. In this work, we explore leveraging pre-trained text-to-image diffusion models to obtain task-adaptive visual representations for robotic control, without fine-tuning the model itself. However, we find that naively applying textual conditions - a successful strategy in other vision domains - yields minimal or even negative gains in control tasks. We attribute this to the domain gap between the diffusion model's training data and robotic control environments, leading us to argue for conditions that consider the specific, dynamic visual information required for control. To this end, we propose ORCA, which introduces learnable task prompts that adapt to the control environment and visual prompts that capture fine-grained, frame-specific details. Through facilitating task-adaptive representations with our newly devised conditions, our approach achieves state-of-the-art performance on various robotic control benchmarks, significantly surpassing prior methods.
翻译:尽管预训练的视觉表征已显著推进模仿学习的发展,但这些表征在策略学习过程中通常保持冻结状态,因而往往缺乏任务针对性。本研究探讨如何利用预训练的文本到图像扩散模型获取面向机器人控制的任务自适应视觉表征,同时避免对模型本身进行微调。然而我们发现,将文本条件直接应用于控制任务(该策略在其他视觉领域行之有效)效能甚微甚至产生负收益。我们将其归因于扩散模型训练数据与机器人控制环境之间的领域鸿沟,由此主张应针对控制所需的特定动态视觉信息设计条件。为此,我们提出ORCA方法,通过引入可学习的任务提示(适应控制环境特质)与视觉提示(捕捉细粒度的帧级细节)。借助新设计的条件机制促进任务自适应表征,本方法在多个机器人控制基准测试中取得了最优性能,显著超越了现有方法。