SimVLG: Simple and Efficient Pretraining of Visual Language Generative Models

In this paper, we propose ``SimVLG'', a streamlined framework for the pre-training of computationally intensive vision-language generative models, leveraging frozen pre-trained large language models (LLMs). The prevailing paradigm in vision-language pre-training (VLP) typically involves a two-stage optimization process: an initial resource-intensive phase dedicated to general-purpose vision-language representation learning, aimed at extracting and consolidating pertinent visual features, followed by a subsequent phase focusing on end-to-end alignment between visual and linguistic modalities. Our one-stage, single-loss framework circumvents the aforementioned computationally demanding first stage of training by gradually merging similar visual tokens during training. This gradual merging process effectively compacts the visual information while preserving the richness of semantic content, leading to fast convergence without sacrificing performance. Our experiments show that our approach can speed up the training of vision-language models by a factor $\times 5$ without noticeable impact on the overall performance. Additionally, we show that our models can achieve comparable performance to current vision-language models with only $1/10$ of the data. Finally, we demonstrate how our image-text models can be easily adapted to video-language generative tasks through a novel soft attentive temporal token merging modules.

翻译：本文提出"SimVLG"，一个利用冻结预训练大语言模型（LLMs）对计算密集型视觉语言生成模型进行预训练的简化框架。当前视觉语言预训练（VLP）的主流范式通常涉及两阶段优化过程：初始阶段专注于通用视觉语言表示学习，旨在提取并整合相关视觉特征，但需消耗大量计算资源；后续阶段则聚焦于视觉与语言模态间的端到端对齐。我们提出的单阶段、单损失框架通过训练中逐步合并相似视觉令牌的方式，规避了前述计算密集型的第一阶段训练。这种渐进式合并过程在保持语义内容丰富性的同时有效压缩视觉信息，从而实现快速收敛且性能无损。实验表明，本方法可将视觉语言模型的训练速度提升5倍，且对整体性能无显著影响。此外，仅需1/10的数据量，我们的模型即可达到与当前视觉语言模型相当的性能。最后，我们展示了如何通过新颖的软注意力时序令牌合并模块，将图像-文本模型轻松迁移至视频语言生成任务。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日