Foundation Model is Efficient Multimodal Multitask Model Selector

This paper investigates an under-explored but important problem: given a collection of pre-trained neural networks, predicting their performance on each multi-modal task without fine-tuning them, such as image recognition, referring, captioning, visual question answering, and text question answering. A brute-force approach is to finetune all models on all target datasets, bringing high computational costs. Although recent-advanced approaches employed lightweight metrics to measure models' transferability,they often depend heavily on the prior knowledge of a single task, making them inapplicable in a multi-modal multi-task scenario. To tackle this issue, we propose an efficient multi-task model selector (EMMS), which employs large-scale foundation models to transform diverse label formats such as categories, texts, and bounding boxes of different downstream tasks into a unified noisy label embedding. EMMS can estimate a model's transferability through a simple weighted linear regression, which can be efficiently solved by an alternating minimization algorithm with a convergence guarantee. Extensive experiments on 5 downstream tasks with 24 datasets show that EMMS is fast, effective, and generic enough to assess the transferability of pre-trained models, making it the first model selection method in the multi-task scenario. For instance, compared with the state-of-the-art method LogME enhanced by our label embeddings, EMMS achieves 9.0\%, 26.3\%, 20.1\%, 54.8\%, 12.2\% performance gain on image recognition, referring, captioning, visual question answering, and text question answering, while bringing 5.13x, 6.29x, 3.59x, 6.19x, and 5.66x speedup in wall-clock time, respectively. The code is available at https://github.com/OpenGVLab/Multitask-Model-Selector.

翻译：本文研究了一个未被充分探索但重要的问题：给定一组预训练神经网络，在不进行微调的情况下预测其在每个多模态任务（如图像识别、指代、图像描述、视觉问答和文本问答）上的性能。一种直接的方法是针对所有目标数据集微调所有模型，这带来了高昂的计算成本。尽管近期先进方法采用轻量级指标来衡量模型的可迁移性，但它们通常严重依赖单一任务的先验知识，使其不适用于多模态多任务场景。为解决这一问题，我们提出了一种高效的多任务模型选择器（EMMS），它利用大规模基础模型将不同下游任务的各类标签格式（如类别、文本和边界框）转换为统一的噪声标签嵌入。EMMS可通过简单的加权线性回归估计模型的可迁移性，并利用一种具有收敛保证的交替最小化算法高效求解。在5个下游任务、24个数据集上的大量实验表明，EMMS快速、有效且足够通用，能够评估预训练模型的可迁移性，成为多任务场景下首个模型选择方法。例如，与经我们标签嵌入增强的最先进方法LogME相比，EMMS在图像识别、指代、图像描述、视觉问答和文本问答任务上分别实现了9.0%、26.3%、20.1%、54.8%和12.2%的性能提升，同时壁钟时间分别加速了5.13倍、6.29倍、3.59倍、6.19倍和5.66倍。代码开源在https://github.com/OpenGVLab/Multitask-Model-Selector。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日