Improving Image Captioning Descriptiveness by Ranking and LLM-based Fusion

State-of-The-Art (SoTA) image captioning models often rely on the Microsoft COCO (MS-COCO) dataset for training. This dataset contains annotations provided by human annotators, who typically produce captions averaging around ten tokens. However, this constraint presents a challenge in effectively capturing complex scenes and conveying detailed information. Furthermore, captioning models tend to exhibit bias towards the ``average'' caption, which captures only the more general aspects. What would happen if we were able to automatically generate longer captions, thereby making them more detailed? Would these captions, evaluated by humans, be more or less representative of the image content compared to the original MS-COCO captions? In this paper, we present a novel approach to address previous challenges by showcasing how captions generated from different SoTA models can be effectively fused, resulting in richer captions. Our proposed method leverages existing models from the literature, eliminating the need for additional training. Instead, it utilizes an image-text based metric to rank the captions generated by SoTA models for a given image. Subsequently, the top two captions are fused using a Large Language Model (LLM). Experimental results demonstrate the effectiveness of our approach, as the captions generated by our model exhibit higher consistency with human judgment when evaluated on the MS-COCO test set. By combining the strengths of various SoTA models, our method enhances the quality and appeal of image captions, bridging the gap between automated systems and the rich, informative nature of human-generated descriptions. This advance opens up new possibilities for generating captions that are more suitable for the training of both vision-language and captioning models.

翻译：当前最先进的（SoTA）图像描述生成模型通常依赖于微软COCO（MS-COCO）数据集进行训练。该数据集包含由人工标注员提供的注释，这些标注员生成的描述平均长度约为十个词。然而，这一限制在有效捕捉复杂场景并传达详细信息方面构成了挑战。此外，描述生成模型往往对仅涵盖通用方面的“平均”描述存在偏见。如果我们能够自动生成更长的描述，从而使其更加详细，结果会如何？经过人工评估，这些描述相较于原始MS-COCO描述，是否更能代表图像内容？在本文中，我们提出了一种新颖的方法来解决上述挑战，通过展示如何有效融合来自不同SoTA模型生成的描述，从而获得更丰富的描述。我们提出的方法利用现有文献中的模型，无需额外训练。相反，它基于一种图像-文本度量指标，对给定图像中由SoTA模型生成的描述进行排序。随后，利用大型语言模型（LLM）将排名前两位的描述进行融合。实验结果表明了本方法的有效性：在MS-COCO测试集上评估时，我们模型生成的描述与人类判断具有更高的一致性。通过融合多个SoTA模型的优势，我们的方法提升了图像描述的质量与吸引力，弥合了自动化系统与人类生成描述在丰富性及信息性方面的差距。这一进展为生成更适用于视觉-语言模型及描述生成模型训练的描述开辟了新的可能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日