We propose a novel framework for filtering image-text data by leveraging fine-tuned Multimodal Language Models (MLMs). Our approach outperforms predominant filtering methods (e.g., CLIPScore) via integrating the recent advances in MLMs. We design four distinct yet complementary metrics to holistically measure the quality of image-text data. A new pipeline is established to construct high-quality instruction data for fine-tuning MLMs as data filters. Comparing with CLIPScore, our MLM filters produce more precise and comprehensive scores that directly improve the quality of filtered data and boost the performance of pre-trained models. We achieve significant improvements over CLIPScore on popular foundation models (i.e., CLIP and BLIP2) and various downstream tasks. Our MLM filter can generalize to different models and tasks, and be used as a drop-in replacement for CLIPScore. An additional ablation study is provided to verify our design choices for the MLM filter.
翻译:我们提出了一种创新框架,通过利用精调后的多模态语言模型(MLMs)来过滤图像-文本数据。该方法整合了MLM领域的最新进展,其性能优于主流过滤方法(如CLIPScore)。我们设计了四个独立且互补的指标,以全面衡量图像-文本数据的质量,并构建了一套新流程,用于生成高质量指令数据以精调MLM作为数据过滤器。与CLIPScore相比,我们的MLM过滤器可生成更精确、更全面的评分,直接提升过滤后数据的质量及预训练模型的性能。在主流基础模型(如CLIP和BLIP2)及多种下游任务上,我们均取得了显著优于CLIPScore的效果。该MLM过滤器可泛化至不同模型与任务,并可直接替代CLIPScore使用。此外,我们通过消融实验验证了MLM过滤器设计方案的合理性。