We propose a self-supervised shared encoder model that achieves strong results on several visual, language and multimodal benchmarks while being data, memory and run-time efficient. We make three key contributions. First, in contrast to most existing works, we use a single transformer with all the encoder layers processing both the text and the image modalities. Second, we propose a stage-wise training strategy where the model is first trained on images, then jointly with unimodal text and image datasets and finally jointly with text and text-image datasets. Third, to preserve information across both the modalities, we propose a training pipeline that learns simultaneously from gradient updates of different modalities at each training update step. The results on downstream text-only, image-only and multimodal tasks show that our model is competitive with several strong models while using fewer parameters and lesser pre-training data. For example, MoMo performs competitively with FLAVA on multimodal (+3.1), image-only (+1.1) and text-only (-0.1) tasks despite having 2/5th the number of parameters and using 1/3rd the image-text training pairs. Finally, we ablate various design choices and further show that increasing model size produces significant performance gains indicating potential for substantial improvements with larger models using our approach.
翻译:我们提出了一种自监督共享编码器模型,该模型在多个视觉、语言及多模态基准测试中取得了优异结果,同时具备数据、内存和运行时的高效性。本文做出三项关键贡献:第一,与现有大部分工作不同,我们采用单一Transformer架构,其所有编码器层均同时处理文本与图像模态;第二,提出分阶段训练策略,模型先基于图像进行训练,随后联合单模态文本与图像数据集,最终联合文本与文本-图像数据集;第三,为保持跨模态信息,我们设计了一种训练流程,能在每次训练更新步骤中同时学习不同模态的梯度更新。在下游纯文本、纯图像及多模态任务上的结果表明,我们的模型在参数更少、预训练数据更少的情况下,与多个强基线模型具有竞争力。例如,MoMo在多模态(+3.1)、纯图像(+1.1)和纯文本(-0.1)任务上的表现与FLAVA相当,但参数量仅为后者的2/5,且仅使用1/3的图像-文本训练对。最后,我们消融了多种设计选择,并进一步表明模型规模扩大可带来显著的性能提升,预示采用本方法的更大规模模型具有实质性改进潜力。