This paper presents a Keyword-driven and N-gram Graph based approach for Image Captioning (KENGIC). Most current state-of-the-art image caption generators are trained end-to-end on large scale paired image-caption datasets which are very laborious and expensive to collect. Such models are limited in terms of their explainability and their applicability across different domains. To address these limitations, a simple model based on N-Gram graphs which does not require any end-to-end training on paired image captions is proposed. Starting with a set of image keywords considered as nodes, the generator is designed to form a directed graph by connecting these nodes through overlapping n-grams as found in a given text corpus. The model then infers the caption by maximising the most probable n-gram sequences from the constructed graph. To analyse the use and choice of keywords in context of this approach, this study analysed the generation of image captions based on (a) keywords extracted from gold standard captions and (b) from automatically detected keywords. Both quantitative and qualitative analyses demonstrated the effectiveness of KENGIC. The performance achieved is very close to that of current state-of-the-art image caption generators that are trained in the unpaired setting. The analysis of this approach could also shed light on the generation process behind current top performing caption generators trained in the paired setting, and in addition, provide insights on the limitations of the current most widely used evaluation metrics in automatic image captioning.
翻译:本文提出了一种基于关键词驱动与N-Gram图模型的图像描述方法(KENGIC)。目前最先进的图像描述生成器大多在大规模配对的图像-描述数据集上以端到端方式训练,此类数据集的收集过程极为耗时且成本高昂。这类模型在可解释性及跨领域适用性方面存在局限。为解决上述问题,本文提出了一种基于N-Gram图的简单模型,无需任何配对图像-描述的端到端训练。该生成器以一组被视为节点的图像关键词为起点,通过给定文本语料库中发现的交叠n-gram连接这些节点,形成有向图。随后,模型通过最大化从构建图中提取的最可能n-gram序列来推断描述文本。为分析该背景下关键词的使用与选择,本研究分别基于(a)从标注描述中提取的关键词和(b)自动检测的关键词进行了图像描述生成实验。定量与定性分析均验证了KENGIC的有效性。其性能已非常接近当前在非配对训练设定下表现最优的图像描述生成器。该方法分析不仅有助于揭示当前顶级配对训练描述生成器的生成机理,还能为自动图像描述中广泛使用的主流评估指标的局限性提供新的见解。