Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g., 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps), sample efficiency (requiring 11-16 times less computation than baselines), world knowledge depth, and scalability. These effectiveness, efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training.
翻译:大型多模态模型展现出在零样本场景下执行多样多模态任务的显著泛化能力。基于网络的大规模图文对数据为此成功提供了基础性支撑,但存在严重噪声问题。近期研究采用由描述模型生成的替代描述文本,在基准测试中取得了显著性能提升。然而,我们的实验揭示,使用合成描述训练的模型存在显著的扩展性缺陷与世界知识缺失问题,这些问题在很大程度上被其初始基准测试的成功所掩盖。通过深入分析,我们识别到根本原因在于现有合成描述存在过度简化的语言结构及知识细节缺失。为提供更高质量且更具可扩展性的多模态预训练数据,我们提出CapsFusion框架——该先进框架利用大语言模型整合并精炼来自网络图文对与合成描述的信息。大量实验表明,CapsFusion生成的描述在模型性能(如COCO和NoCaps数据集上CIDEr得分分别提升18.8和18.3)、样本效率(所需计算量仅为基线方法的1/11至1/16)、世界知识深度及可扩展性方面均展现出相较于现有描述的全面优越性。这种在有效性、效率与可扩展性层面的优势,使CapsFusion成为未来大规模多模态模型训练的理想候选方案。