Successfully addressing a wide variety of tasks is a core ability of autonomous agents, which requires flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the underlying perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, in this work, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the policy and visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks of the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given visual demonstrations.
翻译:成功处理各类任务是自主智能体的核心能力,这需要灵活调整底层决策策略,并且如我们在此工作中所述,还需要调整底层感知模块。类比而言,人类视觉系统通过自上而下的信号聚焦与当前任务相关的注意力。类似地,本研究在多任务策略学习的背景下,针对特定下游任务对预训练的大规模视觉模型进行自适应调整。我们提出任务条件化适配器,无需微调任何预训练权重,并结合通过行为克隆训练且能够处理多个任务的单一策略。策略与视觉适配器以任务嵌入为条件——若任务已知可在推理时选择该嵌入,或通过一组示例演示推断得出。为此,我们提出一种新的基于优化的估计器。我们在CortexBench基准的多种任务上评估该方法,结果表明相比现有工作,该方法可通过单一策略完成所有任务。特别地,我们证明自适应视觉特征是关键设计选择,且该方法在给定视觉演示时能泛化至未见任务。