If you ask a human to describe an image, they might do so in a thousand different ways. Traditionally, image captioning models are trained to approximate the reference distribution of image captions, however, doing so encourages captions that are viewpoint-impoverished. Such captions often focus on only a subset of the possible details, while ignoring potentially useful information in the scene. In this work, we introduce a simple, yet novel, method: "Image Captioning by Committee Consensus" ($IC^3$), designed to generate a single caption that captures high-level details from several viewpoints. Notably, humans rate captions produced by $IC^3$ at least as helpful as baseline SOTA models more than two thirds of the time, and $IC^3$ captions can improve the performance of SOTA automated recall systems by up to 84%, indicating significant material improvements over existing SOTA approaches for visual description. Our code is publicly available at https://github.com/DavidMChan/caption-by-committee
翻译:若要求人类描述一幅图像,他们可能用上千种不同方式表达。传统上,图像描述生成模型被训练以逼近图像描述的参考分布,但这样做会催生视角贫乏的描述——这类描述往往仅聚焦于部分可能细节,却忽略了场景中潜在的有用信息。本文提出一种简单而新颖的方法:"基于委员会共识的图像描述生成"($IC^3$),旨在从多个视角生成包含高层次细节的单一描述。值得注意的是,人类评估认为$IC^3$生成的描述在超过三分之二的场景中至少与基线最优模型同样有用;同时,$IC^3$描述可提升最优自动化召回系统的性能高达84%,这表明该方法在视觉描述领域相较于现有最优方案取得了显著实质性改进。我们的代码已开源,访问地址:https://github.com/DavidMChan/caption-by-committee