Successfully addressing a wide variety of tasks is a core ability of autonomous agents, requiring flexibly adapting the underlying decision-making strategies and, as we argue in this work, also adapting the perception modules. An analogical argument would be the human visual system, which uses top-down signals to focus attention determined by the current task. Similarly, we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights, combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings, which can be selected at inference if the task is known, or alternatively inferred from a set of example demonstrations. To this end, we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that, compared to existing work, it can be addressed with a single policy. In particular, we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations.
翻译:成功处理多种任务是自主智能体的核心能力,这要求灵活调整底层决策策略,并如本文所述,还需调整感知模块。类比人类视觉系统通过自上而下的信号根据当前任务聚焦注意力,我们针对多任务策略学习场景,对预训练的大规模视觉模型进行特定下游任务条件化适配。我们提出的任务条件化适配器无需微调任何预训练权重,可与单一策略(通过行为克隆训练)结合处理多项任务。该适配器基于任务嵌入进行条件化——若任务已知可在推理阶段直接选择任务嵌入,或通过示例演示集推断得出。为此,我们提出一种新的基于优化的估计器。在CortexBench基准测试的多种任务上,该方法能够通过单一策略处理所有任务。特别地,我们证明了视觉特征适配是核心设计选择,且该方法在仅需少量演示的情况下即可泛化至未见任务。