Modern image captioning system relies heavily on extracting knowledge from images to capture the concept of a static story. In this paper, we propose a textual visual context dataset for captioning, in which the publicly available dataset COCO Captions (Lin et al., 2014) has been extended with information about the scene (such as objects in the image). Since this information has a textual form, it can be used to leverage any NLP task, such as text similarity or semantic relation methods, into captioning systems, either as an end-to-end training strategy or a post-processing based approach.
翻译:现代图像描述系统高度依赖从图像中提取知识以捕捉静态情节的概念。本文提出一个面向图像描述的文本视觉上下文数据集,该数据集基于公开可用的COCO Captions数据集(Lin 等,2014),通过扩展图像场景信息(如图像中的物体)构建而成。由于此类信息具有文本形式,可将其引入描述系统,以端到端训练策略或后处理方式,利用文本相似度或语义关系方法等NLP任务。