This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models' transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.
翻译:本文研究了一个未被充分探索但重要的问题:给定一组预训练神经网络,在不进行微调的情况下预测其在每个多模态任务(如图像识别、指代、图像描述、视觉问答和文本问答)上的性能。一种直接的方法是针对所有目标数据集微调所有模型,这带来了高昂的计算成本。尽管近期先进方法采用轻量级指标来衡量模型的可迁移性,但它们通常严重依赖单一任务的先验知识,使其不适用于多模态多任务场景。为解决这一问题,我们提出了一种高效的多任务模型选择器(EMMS),它利用大规模基础模型将不同下游任务的各类标签格式(如类别、文本和边界框)转换为统一的噪声标签嵌入。EMMS可通过简单的加权线性回归估计模型的可迁移性,并利用一种具有收敛保证的交替最小化算法高效求解。在5个下游任务、24个数据集上的大量实验表明,EMMS快速、有效且足够通用,能够评估预训练模型的可迁移性,成为多任务场景下首个模型选择方法。例如,与经我们标签嵌入增强的最先进方法LogME相比,EMMS在图像识别、指代、图像描述、视觉问答和文本问答任务上分别实现了9.0%、26.3%、20.1%、54.8%和12.2%的性能提升,同时壁钟时间分别加速了5.13倍、6.29倍、3.59倍、6.19倍和5.66倍。代码开源在https://github.com/OpenGVLab/Multitask-Model-Selector。