Large-scale pretraining and instruction tuning have been successful for training general-purpose language models with broad competencies. However, extending to general-purpose vision-language models is challenging due to the distributional diversity in visual inputs. A recent line of work explores vision-language instruction tuning, taking inspiration from the Query Transformer (QFormer) approach proposed in BLIP-2 models for bridging frozen modalities. However, these approaches rely heavily on large-scale multi-modal pretraining for representation learning before eventual finetuning, incurring a huge computational overhead, poor scaling, and limited accessibility. To that end, we propose a more efficient method for QFormer-based vision-language alignment and demonstrate the effectiveness of our strategy compared to existing baselines in improving the efficiency of vision-language pretraining.
翻译:大规模预训练和指令调优已成功训练出具备广泛能力的通用语言模型。然而,由于视觉输入分布的多样性,将这种方法扩展至通用视觉语言模型仍面临挑战。近期一系列研究探索了视觉语言指令调优,其灵感源自BLIP-2模型中提出的查询变换器(QFormer)方法,用于桥接冻结模态。然而,这些方法严重依赖大规模多模态预训练进行表征学习,最终微调前需承担巨大计算开销、扩展性差且可访问性受限。为此,我们提出一种基于QFormer的视觉语言对齐更高效方法,并通过实验证明,与现有基线相比,我们的策略在提升视觉语言预训练效率方面具有显著优势。