Multimodal large language models (MLLMs) have made significant strides by integrating visual and textual modalities. A critical factor in training MLLMs is the quality of image-text pairs within multimodal pretraining datasets. However, $\textit {de facto}$ filter-based data quality enhancement paradigms often discard a substantial portion of high-quality image data due to inadequate semantic alignment between images and texts, leading to inefficiencies in data utilization and scalability. In this paper, we propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs. AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities by integrating deliberately selected low-quality samples during training. Unlike prior approaches that significantly alter text distributions, our method minimally adjusts text to preserve data volume while enhancing quality. Experimental results demonstrate that AITQE surpasses existing methods on various benchmark, effectively leveraging raw data and scaling efficiently with increasing data volumes. We hope our work will inspire future works. The code and model are available at: https://github.com/hanhuang22/AITQE.
翻译:多模态大语言模型(MLLMs)通过整合视觉与文本模态取得了显著进展。训练MLLMs的一个关键因素在于多模态预训练数据集中图文对的质量。然而,$\textit {基于过滤的事实标准}$数据质量增强范式常因图像与文本间语义对齐不足而丢弃大量高质量图像数据,导致数据利用效率低下且可扩展性受限。本文提出自适应图文质量增强器(AITQE),该模型能动态评估并提升图文对质量。AITQE对低质量图文对采用文本重写机制,并通过在训练中引入精心筛选的低质量样本的负样本学习策略来提升评估能力。与先前显著改变文本分布的方法不同,我们的方法以最小幅度调整文本,在提升质量的同时保持数据规模。实验结果表明,AITQE在多种基准测试中超越现有方法,能有效利用原始数据并随数据量增长实现高效扩展。我们希望本研究能为未来工作提供启发。代码与模型已开源:https://github.com/hanhuang22/AITQE。