In this paper, we propose ``SimVLG'', a streamlined framework for the pre-training of computationally intensive vision-language generative models, leveraging frozen pre-trained large language models (LLMs). The prevailing paradigm in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, aimed at extracting and consolidating pertinent visual features, followed by a subsequent phase focusing on end-to-end alignment between visual and linguistic modalities. Our one-stage, single-loss framework circumvents the aforementioned computationally demanding first stage of training by gradually merging similar visual tokens during training. This gradual merging process effectively compacts the visual information while preserving the richness of semantic content, leading to fast convergence without sacrificing performance. Our experiments show that our approach can speed up the training of vision-language models by a factor $\times 5$ without noticeable impact on the overall performance. Additionally, we show that our models can achieve comparable performance to current vision-language models with only $1/10$ of the data. Finally, we demonstrate how our image-text models can be easily adapted to video-language generative tasks through a novel soft attentive temporal token merging modules.
翻译:本文提出"SimVLG",一个利用冻结预训练大语言模型(LLMs)对计算密集型视觉语言生成模型进行预训练的简化框架。当前视觉语言预训练(VLP)的主流范式通常涉及两阶段优化过程:初始阶段专注于通用视觉语言表示学习,旨在提取并整合相关视觉特征,但需消耗大量计算资源;后续阶段则聚焦于视觉与语言模态间的端到端对齐。我们提出的单阶段、单损失框架通过训练中逐步合并相似视觉令牌的方式,规避了前述计算密集型的第一阶段训练。这种渐进式合并过程在保持语义内容丰富性的同时有效压缩视觉信息,从而实现快速收敛且性能无损。实验表明,本方法可将视觉语言模型的训练速度提升5倍,且对整体性能无显著影响。此外,仅需1/10的数据量,我们的模型即可达到与当前视觉语言模型相当的性能。最后,我们展示了如何通过新颖的软注意力时序令牌合并模块,将图像-文本模型轻松迁移至视频语言生成任务。