We present a novel methodology aimed at optimizing the application of frozen large language models (LLMs) for resource-intensive vision-language (VL) pre-training. The current paradigm uses visual features as prompts to guide language models, with a focus on determining the most relevant visual features for corresponding text. Our approach diverges by concentrating on the language component, specifically identifying the optimal prompts to align with visual features. We introduce the Prompt-Transformer (P-Former), a model that predicts these ideal prompts, which is trained exclusively on linguistic data, bypassing the need for image-text pairings. This strategy subtly bifurcates the end-to-end VL training process into an additional, separate stage. Our experiments reveal that our framework significantly enhances the performance of a robust image-to-text baseline (BLIP-2), and effectively narrows the performance gap between models trained with either 4M or 129M image-text pairs. Importantly, our framework is modality-agnostic and flexible in terms of architectural design, as validated by its successful application in a video learning task using varied base modules. The code is available at https://github.com/yiren-jian/BLIText
翻译:我们提出了一种新颖的方法论,旨在优化冻结大语言模型(LLMs)在资源密集型视觉语言(VL)预训练中的应用。当前范式将视觉特征作为提示来引导语言模型,重点在于确定与对应文本最相关的视觉特征。我们的方法则聚焦于语言组件,特别致力于识别与视觉特征对齐的最优提示。我们引入了Prompt-Transformer(P-Former),一个预测理想提示的模型,该模型仅通过语言数据进行训练,无需图像-文本配对。这一策略巧妙地将端到端的VL训练过程分解为额外的独立阶段。我们的实验表明,该框架显著增强了鲁棒图文基线模型(BLIP-2)的性能,并有效缩小了使用400万与1.29亿图像-文本对训练的模型之间的性能差距。重要的是,该框架在模态上具有无关性,且架构设计灵活——其在视频学习任务中基于不同基础模块的成功应用验证了这一点。代码已开源:https://github.com/yiren-jian/BLIText