Transform, Contrast and Tell: Coherent Entity-Aware Multi-Image Captioning

Coherent entity-aware multi-image captioning aims to generate coherent captions for neighboring images in a news document. There are coherence relationships among neighboring images because they often describe same entities or events. These relationships are important for entity-aware multi-image captioning, but are neglected in entity-aware single-image captioning. Most existing work focuses on single-image captioning, while multi-image captioning has not been explored before. Hence, this paper proposes a coherent entity-aware multi-image captioning model by making use of coherence relationships. The model consists of a Transformer-based caption generation model and two types of contrastive learning-based coherence mechanisms. The generation model generates the caption by paying attention to the image and the accompanying text. The caption-caption coherence mechanism aims to render entities in the caption of the image be also in captions of neighboring images. The caption-image-text coherence mechanism aims to render entities in the caption of the image be also in the accompanying text. To evaluate coherence between captions, two coherence evaluation metrics are proposed. The new dataset DM800K is constructed that has more images per document than two existing datasets GoodNews and NYT800K, and is more suitable for multi-image captioning. Experiments on three datasets show the proposed captioning model outperforms 7 baselines according to BLUE, Rouge, METEOR, and entity precision and recall scores. Experiments also show that the generated captions are more coherent than that of baselines according to caption entity scores, caption Rouge scores, the two proposed coherence evaluation metrics, and human evaluations.

翻译：一致性的实体感知多图像描述生成旨在为新闻文档中相邻图像生成连贯的描述。相邻图像之间存在连贯关系，因为它们通常描述相同的实体或事件。这些关系对实体感知的多图像描述至关重要，但在实体感知的单图像描述中却被忽略。现有研究主要关注单图像描述，而多图像描述此前尚未被探索。因此，本文利用连贯关系提出了一种一致性的实体感知多图像描述模型。该模型包含一个基于Transformer的描述生成模型和两种基于对比学习的连贯性机制。生成模型通过关注图像和伴随文本来生成描述。描述-描述连贯性机制旨在使图像描述中的实体也出现在相邻图像的描述中。描述-图像-文本连贯性机制旨在使图像描述中的实体也出现在伴随文本中。为评估描述之间的连贯性，提出了两种连贯性评估指标。构建了新数据集DM800K，其每篇文档包含的图像多于现有数据集GoodNews和NYT800K，更适合多图像描述。在三个数据集上的实验表明，所提出的描述模型在BLUE、Rouge、METEOR以及实体精确率和召回率指标上优于7个基线模型。实验还表明，根据描述实体分数、描述Rouge分数、所提出的两种连贯性评估指标以及人工评估，生成的描述比基线模型更连贯。