If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" ($IC^3$), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by $IC^3$ at least as helpful as baseline SOTA models more than two thirds of the time, and $IC^3$ captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee
翻译:若请人类描述一幅图像,他们可能以千差万别的方式完成。传统上,图像描述生成模型被训练以逼近图像描述的参考分布,然而这种做法会鼓励生成视角贫乏的描述。此类描述往往仅聚焦于部分可能的细节,而忽略场景中可能存在的有用信息。本研究提出一种简洁而新颖的方法——"基于委员会共识的图像描述生成"($IC^3$),旨在生成能捕捉多个视角高层细节的单一描述。值得注意的是,在超过三分之二的评估中,人类评价$IC^3$生成的描述至少与基线最先进(SOTA)模型同等有用;同时,$IC^3$描述可将现有SOTA自动召回系统的性能提升高达84%,表明其在视觉描述领域相较现有SOTA方法具有显著实质改进。我们的代码已公开于 https://github.com/DavidMChan/caption-by-committee