A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, and conduct rigorous analyses to demonstrate the importance of each part of our design.
翻译:图像描述领域已取得显著进展,这一进展得益于对如何利用预训练模型编码图像的研究。这些编码包括视觉编码(如图像网格特征或检测目标)以及近期出现的文本编码(如图像标签或图像区域的文本描述)。随着更先进的编码方式被引入与应用,一个自然的问题随之产生:如何高效且有效地利用这些异构的编码集合?本文提出将多种编码视为输入图像的增强视图。该图像描述模型通过共享编码器高效地对每个视图进行独立编码,并创新性地在编码视图间引入对比损失以提升其表征质量及模型的数据效率。我们提出的层次化解码器首先在词汇层面上聚合每个视图的内部信息,然后在视图层面跨视图聚合,从而根据各视图对描述生成的有效性自适应地为其分配权重。实验表明,与现有最优方法相比,本方法在MS-COCO数据集上CIDEr指标提升5.6%,在Flickr30k数据集上提升12.9%,并通过严谨分析证明了模型各组成部分的重要性。