Neural captioners are typically trained to mimic human-generated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task.
翻译:神经描述生成器通常被训练为模仿人类生成的参考描述,而无需优化任何特定沟通目标,这导致生成模糊描述等问题。本文证明,通过自监督判别性沟通目标对现成神经描述生成器进行微调,有助于恢复更平实、更视觉化的描述语言,从而更充分地传达图像内容信息。给定目标图像,系统必须学习生成一段描述,使现成的文本条件图像检索器能够从候选图像中识别出该图像。我们采用流行的ClipCap描述生成器进行实验,并使用BLIP复现了主要结果。在生成描述与真实人类描述的相似度方面,当非微调模型在同一描述数据集上训练和测试时,判别性微调产生的描述略低于非微调模型生成的描述。然而,当模型无需进一步微调即可用于生成跨领域数据集的描述时,我们经过判别性微调的描述生成器生成的描述比相同但未经微调的描述生成器产生的描述更接近人类参考描述。我们进一步证明,在Conceptual Captions数据集上,对于执行图像判别任务的人类标注者而言,经过判别性微调生成的描述比原始ClipCap描述或真实人类描述更有帮助。