Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL's modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/mair-lab/mapl.
翻译:大型预训练模型在单模态视觉和语言任务中已被证明是卓越的零样本和(基于提示的)少样本学习器。我们提出MAPL——一种简单且参数高效的方法,该方法重用冻结的单模态预训练模型,并利用其在多模态视觉-语言(VL)场景中的强大泛化能力。MAPL通过对齐图像-文本数据,学习单模态模型表示空间之间的轻量级映射,并能够仅从少量上下文示例泛化到未见过的VL任务。少量的可训练参数使MAPL在低数据和域内学习中表现高效。此外,MAPL的模块化特性使其易于扩展至其他预训练模型。在多个视觉问答和图像描述基准上的大量实验表明,与类似方法相比,MAPL在训练参数数量少几个数量级的情况下,达到了优越或具有竞争力的性能。MAPL仅需使用中等计算资源和公开数据集即可在数小时内完成训练。我们在https://github.com/mair-lab/mapl 公开发布了代码和预训练模型权重。