Successfully addressing a wide variety of tasks is a core ability of autonomous agents, requiring flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations.
翻译:成功应对多种任务是自主智能体的核心能力,这需要灵活调整底层决策策略,以及——正如本文所论证的——调整感知模块。类似的论据是人类视觉系统,它利用自上而下的信号根据当前任务集中注意力。我们同样在多任务策略学习背景下,针对特定下游任务自适应预训练的大规模视觉模型。我们引入了任务条件化适配器,无需微调任何预训练权重,并结合一个通过行为克隆训练、能处理多个任务的单一策略。我们将视觉适配器条件化为任务嵌入,该嵌入可在推理时(若已知任务)直接选择,或通过一组示例演示进行推断。为此,我们提出了一种新的基于优化的估计器。我们在CortexBench基准的多种任务上评估该方法,结果表明,与现有工作相比,该方法可通过单一策略解决这些任务。特别地,我们证明自适应视觉特征是关键设计选择,且该方法能在仅提供少量演示的情况下泛化到未见任务。