If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to generate a single "best" (most like a reference) image caption. Unfortunately, doing so encourages captions that are "informationally impoverished," and focus on only a subset of the possible details, while ignoring other potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" (IC3), designed to generate a single caption that captures high-level details from several annotator viewpoints. Humans rate captions produced by IC3 at least as helpful as baseline SOTA models more than two thirds of the time, and IC3 can improve the performance of SOTA automated recall systems by up to 84%, outperforming single human-generated reference captions, and indicating significant improvements over SOTA approaches for visual description. Code is available at https://davidmchan.github.io/caption-by-committee/
翻译:如果请人类描述一幅图像,他们可能以千差万别的方式完成此项任务。传统上,图像描述生成模型被训练为生成单一的“最佳”(即最接近参考描述的)图像描述。遗憾的是,这种做法鼓励生成“信息贫乏”的描述,仅聚焦于部分可能的细节,而忽略了场景中其他潜在有用的信息。在本研究中,我们提出一种简单而新颖的方法:“基于委员会共识的图像描述生成”(IC3),旨在生成一个能够从多个标注者视角捕捉高层细节的单一描述。人类评估结果显示,IC3生成的描述在超过三分之二的场景中至少达到与基线最优模型相当的有用性;同时,IC3能将先进自动召回系统的性能提升高达84%,超越单个人类生成的参考描述,表明其在视觉描述方法上相较于现有最优方案取得了显著改进。代码已开源:https://davidmchan.github.io/caption-by-committee/