Successfully addressing a wide variety of tasks is a core ability of autonomous agents, requiring flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations.
翻译:成功处理多种任务是自主智能体的核心能力,这需要灵活地调整底层决策策略,并且如本文所述,还需要调整感知模块。一个类比论证是人类视觉系统,它利用自上而下的信号根据当前任务集中注意力。类似地,我们在多任务策略学习背景下,针对特定下游任务对预训练的大视觉模型进行适应。我们引入了任务条件化适配器,无需微调任何预训练权重,并结合通过行为克隆训练的单一策略,能够处理多个任务。我们将视觉适配器条件化于任务嵌入,这些嵌入可以在推理时(若任务已知)直接选择,或通过一组示例演示进行推断。为此,我们提出了一种新的基于优化的估计器。我们在CortexBench基准的多种任务上评估了该方法,结果显示,与现有工作相比,该方法可用单一策略解决这些任务。特别地,我们证明了适应视觉特征是一个关键设计选择,并且该方法在给定少量演示的情况下能够泛化到未见任务。