Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.
翻译:图像标注旨在用自然语言描述视觉内容。由于“一图胜千言”,同一图像可能存在多种正确的描述方式。然而,以最大似然估计为训练目标时,每当模型的预测与标注不一致,模型便会受到惩罚。例如,当模型预测的词汇比标注表达更丰富的语义时,模型会受到惩罚并被优化为倾向于更简洁的表达——这被称为简洁性优化。相反,若预测比标注更简洁,则会触发丰富性优化。这种相互矛盾的优化方向最终可能导致模型生成泛化描述。为此,我们提出半透最大似然估计(SMILE),该方法允许丰富性优化同时阻断简洁性优化,从而激励模型生成更长且包含更多细节的标注。在主流图像标注数据集MSCOCO和Flickr30K上的大量实验表明,SMILE显著提升了生成标注的描述性。我们进一步通过深入研究,帮助更好地理解SMILE的工作机理。