State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. This dataset contains annotations provided by human annotators, who typically produce captions averaging around ten tokens. However, this constraint presents a challenge in effectively capturing complex scenes and conveying detailed information. Furthermore, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects. What would happen if we were able to automatically generate longer captions, thereby making them more detailed? Would these captions, evaluated by humans, be more or less representative of the image content compared to the original MS-COCO captions? In this paper, we present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused, resulting in richer captions. Our proposed method leverages existing models from the literature, eliminating the need for additional training. Instead, it utilizes an image-text based metric to rank the captions generated by SoTA models for a given image. Subsequently, the top two captions are fused using a Large Language Model (LLM). Experimental results demonstrate the effectiveness of our approach, as the captions generated by our model exhibit higher consistency with human judgment when evaluated on the MS-COCO test set. By combining the strengths of various SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich, informative nature of human-generated descriptions. This advance opens up new possibilities for generating captions that are more suitable for the training of both vision-language and captioning models.
翻译:当前最先进的(SoTA)图像描述生成模型通常依赖于微软COCO(MS-COCO)数据集进行训练。该数据集包含由人工标注员提供的注释,这些标注员生成的描述平均长度约为十个词。然而,这一限制在有效捕捉复杂场景并传达详细信息方面构成了挑战。此外,描述生成模型往往对仅涵盖通用方面的“平均”描述存在偏见。如果我们能够自动生成更长的描述,从而使其更加详细,结果会如何?经过人工评估,这些描述相较于原始MS-COCO描述,是否更能代表图像内容?在本文中,我们提出了一种新颖的方法来解决上述挑战,通过展示如何有效融合来自不同SoTA模型生成的描述,从而获得更丰富的描述。我们提出的方法利用现有文献中的模型,无需额外训练。相反,它基于一种图像-文本度量指标,对给定图像中由SoTA模型生成的描述进行排序。随后,利用大型语言模型(LLM)将排名前两位的描述进行融合。实验结果表明了本方法的有效性:在MS-COCO测试集上评估时,我们模型生成的描述与人类判断具有更高的一致性。通过融合多个SoTA模型的优势,我们的方法提升了图像描述的质量与吸引力,弥合了自动化系统与人类生成描述在丰富性及信息性方面的差距。这一进展为生成更适用于视觉-语言模型及描述生成模型训练的描述开辟了新的可能。