Image captioning is a significant field across computer vision and natural language processing. We propose and present AIC-AB NET, a novel Attribute-Information-Combined Attention-Based Network that combines spatial attention architecture and text attributes in an encoder-decoder. For caption generation, adaptive spatial attention determines which image region best represents the image and whether to attend to the visual features or the visual sentinel. Text attribute information is synchronously fed into the decoder to help image recognition and reduce uncertainty. We have tested and evaluated our AICAB NET on the MS COCO dataset and a new proposed Fashion dataset. The Fashion dataset is employed as a benchmark of single-object images. The results show the superior performance of the proposed model compared to the state-of-the-art baseline and ablated models on both the images from MSCOCO and our single-object images. Our AIC-AB NET outperforms the baseline adaptive attention network by 0.017 (CIDEr score) on the MS COCO dataset and 0.095 (CIDEr score) on the Fashion dataset.
翻译:图像描述是计算机视觉与自然语言处理交叉领域的重要研究方向。我们提出并展示了AIC-AB NET,一种新颖的基于注意力机制和属性信息融合的网络,该网络在编码器-解码器框架中结合了空间注意力架构与文本属性。在描述生成过程中,自适应空间注意力机制可确定最能表征图像的区域,并决策是关注视觉特征还是视觉哨兵。文本属性信息被同步输入解码器,以辅助图像识别并降低不确定性。我们在MS COCO数据集和新建的Fashion数据集上对AIC-AB NET进行了测试与评估。Fashion数据集被用作单目标图像的基准测试集。实验结果表明,在MS COCO图像和单目标图像上,所提模型相较于当前最优基线模型及消融模型均展现出更优性能。我们的AIC-AB NET在MS COCO数据集上的CIDEr评分比基线自适应注意力网络提升0.017,在Fashion数据集上提升0.095。