News image captioning task is a variant of image captioning task which requires model to generate a more informative caption with news image and the associated news article. Multimodal Large Language models have developed rapidly in recent years and is promising in news image captioning task. However, according to our experiments, common MLLMs are not good at generating the entities in zero-shot setting. Their abilities to deal with the entities information are still limited after simply fine-tuned on news image captioning dataset. To obtain a more powerful model to handle the multimodal entity information, we design two multimodal entity-aware alignment tasks and an alignment framework to align the model and generate the news image captions. Our method achieves better results than previous state-of-the-art models in CIDEr score (72.33 -> 86.29) on GoodNews dataset and (70.83 -> 85.61) on NYTimes800k dataset.
翻译:新闻图像描述任务是图像描述任务的一种变体,要求模型能够结合新闻图像及其相关新闻文章生成更具信息性的描述。近年来,多模态大语言模型发展迅速,在新闻图像描述任务中展现出巨大潜力。然而,根据我们的实验,常见的多模态大语言模型在零样本设置下难以有效生成实体信息,即使通过简单的微调处理新闻图像描述数据集,其在实体信息处理方面的能力仍然有限。为获得更强大的模型来处理多模态实体信息,我们设计了两种多模态实体感知对齐任务及一个对齐框架,用于实现模型对齐并生成新闻图像描述。我们的方法在GoodNews数据集上CIDEr评分从72.33提升至86.29,在NYTimes800k数据集上从70.83提升至85.61,显著优于此前最优模型。