Image captioning aims to describe visual content in natural language. As 'a picture is worth a thousand words', there could be various correct descriptions for an image. However, with maximum likelihood estimation as the training objective, the captioning model is penalized whenever its prediction mismatches with the label. For instance, when the model predicts a word expressing richer semantics than the label, it will be penalized and optimized to prefer more concise expressions, referred to as conciseness optimization. In contrast, predictions that are more concise than labels lead to richness optimization. Such conflicting optimization directions could eventually result in the model generating general descriptions. In this work, we introduce Semipermeable MaxImum Likelihood Estimation (SMILE), which allows richness optimization while blocking conciseness optimization, thus encouraging the model to generate longer captions with more details. Extensive experiments on two mainstream image captioning datasets MSCOCO and Flickr30K demonstrate that SMILE significantly enhances the descriptiveness of generated captions. We further provide in-depth investigations to facilitate a better understanding of how SMILE works.
翻译:图像描述旨在用自然语言描述视觉内容。由于“一图胜千言”,同一图像可能存在多种正确的描述方式。然而,在以最大似然估计为训练目标时,每当模型预测与标签不匹配,便会受到惩罚。例如,当模型预测的词语比标签包含更丰富语义时,它将被惩罚并优化为偏好更简洁的表达,这一过程称为简洁性优化。相反,若预测比标签更简洁,则会导致丰富性优化。这种相互矛盾的优化方向最终可能导致模型生成泛化描述。本文提出半透性最大似然估计(SMILE)方法,该方法允许丰富性优化进行,同时阻止简洁性优化,从而鼓励模型生成更长且包含更多细节的描述。在主流图像描述数据集MSCOCO和Flickr30K上的大量实验表明,SMILE显著提升了生成描述的详尽程度。我们进一步通过深入探究,帮助读者更好地理解SMILE的工作机制。