COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

In the evolution of Vision-Language Pre-training, shifting from short-text comprehension to encompassing extended textual contexts is pivotal. Recent autoregressive vision-language models like \cite{flamingo, palme}, leveraging the long-context capability of Large Language Models, have excelled in few-shot text generation tasks but face challenges in alignment tasks. Addressing this gap, we introduce the contrastive loss into text generation models, presenting the COntrastive-Streamlined MultimOdal framework (\ModelName), strategically partitioning the language model into dedicated unimodal text processing and adept multimodal data handling components. \ModelName, our unified framework, merges unimodal and multimodal elements, enhancing model performance for tasks involving textual and visual data while notably reducing learnable parameters. However, these models demand extensive long-text datasets, yet the availability of high-quality long-text video datasets remains limited. To bridge this gap, this work introduces \VideoDatasetName, an inaugural interleaved video-text dataset featuring comprehensive captions, marking a significant step forward. Demonstrating its impact, we illustrate how \VideoDatasetName{} enhances model performance in image-text tasks. With 34% learnable parameters and utilizing 72\% of the available data, our model demonstrates significant superiority over OpenFlamingo~\cite{openflamingo}. For instance, in the 4-shot flickr captioning task, performance notably improves from 57.2% to 65.\%. The contributions of \ModelName{} and \VideoDatasetName{} are underscored by notable performance gains across 14 diverse downstream datasets encompassing both image-text and video-text tasks.

翻译：摘要：在视觉-语言预训练的发展进程中，从短文本理解向长文本语境覆盖的转变至关重要。近期基于自回归的视觉-语言模型（如\cite{flamingo, palme}）借助大语言模型的长文本处理能力，在少样本文本生成任务中表现优异，但在对齐任务中仍面临挑战。为弥补这一不足，我们将对比损失引入文本生成模型，提出对比流式多模态框架（\ModelName），该框架策略性地将语言模型划分为专用单模态文本处理与多模态数据适配处理两大组件。\ModelName作为统一框架，融合单模态与多模态元素，在显著减少可学习参数的同时，提升模型处理文本与视觉数据的任务性能。然而，此类模型需要海量长文本数据集，但高质量长文本视频数据集仍十分稀缺。为此，本研究提出\VideoDatasetName——首个包含细粒度标注的交错式视频-文本数据集，标志着重要突破。我们通过实验证明，\VideoDatasetName可有效提升模型在图像-文本任务中的性能。在仅使用34%可学习参数和72%可用数据的条件下，我们的模型性能显著优于OpenFlamingo~\cite{openflamingo}。例如，在4-shot Flickr描述生成任务中，性能从57.2%提升至65.%。在涵盖图像-文本与视频-文本任务的14个多样化下游数据集上，\ModelName与\VideoDatasetName均展现出显著性能增益。