This paper focuses on enhancing the captions generated by image-caption generation systems. We propose an approach for improving caption generation systems by choosing the most closely related output to the image rather than the most likely output produced by the model. Our model revises the language generation output beam search from a visual context perspective. We employ a visual semantic measure in a word and sentence level manner to match the proper caption to the related information in the image. The proposed approach can be applied to any caption system as a post-processing based method.
翻译:本文聚焦于提升图像字幕生成系统所生成的字幕质量。我们提出一种方法,通过选择与图像最相关的输出(而非模型生成的最大概率输出)来改进字幕生成系统。该模型从视觉语境视角对语言生成输出的束搜索进行修正。我们采用词级和句级维度的视觉语义度量,将恰当的字幕与图像中的相关信息进行匹配。所提出的方法可作为基于后处理的技术,适用于任何字幕系统。