Image captioning is a central task in computer vision which has experienced substantial progress following the advent of vision-language pre-training techniques. In this paper, we highlight a frequently overlooked limitation of captioning models that often fail to capture semantically significant elements. This drawback can be traced back to the text-image datasets; while their captions typically offer a general depiction of image content, they frequently omit salient details. To mitigate this limitation, we propose FuseCap - a novel method for enriching captions with additional visual information, obtained from vision experts, such as object detectors, attribute recognizers, and Optical Character Recognizers (OCR). Our approach fuses the outputs of such vision experts with the original caption using a large language model (LLM), yielding enriched captions that present a comprehensive image description. We validate the effectiveness of the proposed caption enrichment method through both quantitative and qualitative analysis. Our method is then used to curate the training set of a captioning model based BLIP which surpasses current state-of-the-art approaches in generating accurate and detailed captions while using significantly fewer parameters and training data. As additional contributions, we provide a dataset comprising of 12M image-enriched caption pairs and show that the proposed method largely improves image-text retrieval.
翻译:图像描述是计算机视觉中的核心任务,随着视觉-语言预训练技术的出现,该领域取得了显著进展。本文揭示了一个常被忽视的局限性:描述模型往往无法捕捉语义上重要的元素。这一缺陷可追溯至文本-图像数据集——尽管其描述通常提供图像内容的泛化描述,但经常遗漏显著细节。为解决此问题,我们提出FuseCap——一种新颖方法,通过从视觉专家(如目标检测器、属性识别器和光学字符识别器)获取额外视觉信息来丰富描述。该方法利用大型语言模型融合此类视觉专家的输出与原始描述,生成提供全面图像描述的富化描述。通过定量与定性分析,我们验证了所提描述富化方法的有效性。随后,该方法被用于筛选基于BLIP的描述模型训练集,该模型在生成准确且详细的描述方面超越了当前最先进方法,同时显著减少参数用量与训练数据。作为额外贡献,我们提供了包含1200万图像-富化描述对的数据集,并证明所提方法大幅提升了图像-文本检索性能。