Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.
翻译:多模态模型的最新进展凸显了重写字幕对提升性能的价值,但关键挑战依然存在。例如,尽管合成字幕通常能提供更优的质量和图文对齐效果,但其能否完全替代AltTexts尚不明确:合成字幕在预训练中的作用及其与原始网络爬取的AltTexts之间的交互机制仍未得到充分理解。此外,不同的多模态基础模型可能对特定字幕格式有独特偏好,但针对各模型识别最优字幕的研究仍显不足。本研究提出了一种新颖、可控且可扩展的字幕生成流程,旨在为各类多模态模型生成多样化的定制字幕格式。通过以短合成字幕(SSC)到密集合成字幕(DSC+)作为案例研究,我们系统性地探究了它们在CLIP、多模态大语言模型及扩散模型等不同模型中的效果及其与AltTexts的交互作用。研究发现,同时保留合成字幕与AltTexts的混合策略能超越仅使用合成字幕的方案,在提升对齐效果与模型性能方面均表现出优势,且每种模型对特定字幕格式均展现出不同偏好。这项综合分析为优化字幕生成策略提供了重要见解,从而推动多模态基础模型的预训练研究向前发展。