Learning Distinct and Representative Styles for Image Captioning

Over the years, state-of-the-art (SoTA) image captioning methods have achieved promising results on some evaluation metrics (e.g., CIDEr). However, recent findings show that the captions generated by these methods tend to be biased toward the "average" caption that only captures the most general mode (a.k.a, language pattern) in the training corpus, i.e., the so-called mode collapse problem. Affected by it, the generated captions are limited in diversity and usually less informative than natural image descriptions made by humans. In this paper, we seek to avoid this problem by proposing a Discrete Mode Learning (DML) paradigm for image captioning. Our innovative idea is to explore the rich modes in the training caption corpus to learn a set of "mode embeddings", and further use them to control the mode of the generated captions for existing image captioning models. Specifically, the proposed DML optimizes a dual architecture that consists of an image-conditioned discrete variational autoencoder (CdVAE) branch and a mode-conditioned image captioning (MIC) branch. The CdVAE branch maps each image caption to one of the mode embeddings stored in a learned codebook, and is trained with a pure non-autoregressive generation objective to make the modes distinct and representative. The MIC branch can be simply modified from an existing image captioning model, where the mode embedding is added to the original word embeddings as the control signal. In the experiments, we apply the proposed DML to two widely used image captioning models, Transformer and AoANet. The results show that the learned mode embedding successfully facilitates these models to generate high-quality image captions with different modes, further leading to better performance for both diversity and quality on the MSCOCO dataset.

翻译：近年来，最先进的图像描述方法在部分评估指标（如CIDEr）上取得了显著成果。然而，最新研究发现，这些方法生成的描述往往偏向于训练语料库中仅能捕捉最通用模式（即语言模式）的"平均"描述——即所谓的模式崩溃问题。受此影响，生成的描述在多样性上存在局限，且通常不如人类撰写的自然图像描述信息丰富。本文通过提出一种离散模式学习框架来规避这一问题。我们的创新思路在于挖掘训练描述语料库中的丰富模式，学习一组"模式嵌入"，进而利用这些嵌入控制现有图像描述模型所生成描述的模式。具体而言，所提出的DML优化了一种双重架构，该架构包含图像条件离散变分自编码器分支和模式条件图像描述分支。CdVAE分支将每幅图像的描述映射到学习所得码本中存储的某个模式嵌入上，并通过纯非自回归生成目标进行训练，以确保模式的独特性和代表性。MIC分支可直接从现有图像描述模型修改而来，将模式嵌入作为控制信号添加到原始词嵌入中。在实验中，我们将所提出的DML应用于两种广泛使用的图像描述模型——Transformer和AoANet。结果表明，学习到的模式嵌入成功促进了这些模型生成具有不同模式的高质量图像描述，进而在MSCOCO数据集上实现了多样性与质量的双重提升。