Recent vision-language models have shown impressive multi-modal generation capabilities. However, typically they require training huge models on massive datasets. As a more scalable alternative, we introduce Prismer, a data- and parameter-efficient vision-language model that leverages an ensemble of task-specific experts. Prismer only requires training of a small number of components, with the majority of network weights inherited from multiple readily-available, pre-trained experts, and kept frozen during training. By leveraging experts from a wide range of domains, we show Prismer can efficiently pool this expert knowledge and adapt it to various vision-language reasoning tasks. In our experiments, we show that Prismer achieves fine-tuned and few-shot learning performance which is competitive with current state-of-the-arts, whilst requiring up to two orders of magnitude less training data. Code is available at https://github.com/NVlabs/prismer.
翻译:近期视觉语言模型展现了令人印象深刻的多模态生成能力。然而,这些模型通常需要在海量数据集上训练庞大的模型。作为一种更具扩展性的替代方案,我们提出了Prismer,一种数据与参数高效的视觉语言模型,它利用一组任务特定专家模型进行集成学习。Prismer仅需训练少量组件,其大部分网络权重继承自多个现成的预训练专家模型,并在训练过程中保持冻结。通过利用来自广泛领域的专家模型,我们证明Prismer能够高效汇集这些专家知识,并将其适配到各类视觉语言推理任务中。实验表明,Prismer在微调和少样本学习场景下取得了与当前最先进模型相竞争的性能表现,同时所需训练数据量最多可减少两个数量级。代码已开源至https://github.com/NVlabs/prismer。