In this paper, we present \textbf{Gen}erative \textbf{L}anguage-\textbf{I}mage \textbf{P}re-training (GenLIP), a minimalist generative pretraining framework for Vision Transformers (ViTs) designed for multimodal large language models (MLLMs). To better align vision encoders with the autoregressive nature of LLMs, GenLIP trains a ViT to predict language tokens directly from visual tokens using a standard language modeling objective, without contrastive batch construction or an additional text decoder. This design offers three key advantages: (1) \textbf{Simplicity}: a single transformer jointly models visual and textual tokens; (2) \textbf{Scalability}: it scales effectively with both data and model size; and (3) \textbf{Performance}: it achieves competitive or superior results across diverse multimodal benchmarks. Trained on 8B samples from Recap-DataComp-1B, GenLIP matches or surpasses strong baselines despite using substantially less pretraining data. After continued pretraining on multi-resolution images at native aspect ratios, GenLIP further improves on detail-sensitive tasks such as OCR and chart understanding, making it a strong foundation for vision encoders in MLLMs.
翻译:本文提出生成式语言-图像预训练(GenLIP),一种专为多模态大语言模型(MLLM)设计的视觉Transformer(ViT)最小化生成式预训练框架。为了更好地对齐视觉编码器与LLM的自回归特性,GenLIP训练ViT直接根据视觉令牌预测语言令牌,采用标准语言建模目标,无需对比批次构建或额外文本解码器。该设计具有三大优势:(1)**简洁性**:单一Transformer联合建模视觉与文本令牌;(2)**可扩展性**:随数据量与模型规模有效扩展;(3)**性能**:在多种多模态基准测试中达到或超越现有方法。使用Recap-DataComp-1B的80亿样本训练后,GenLIP在显著减少预训练数据量的情况下匹配或超越强基线。在原生宽高比的多分辨率图像上继续预训练后,GenLIP进一步提升了OCR和图表理解等细节敏感任务的性能,成为MLLM中视觉编码器的强有力基础。