Having the difficulty of solving the semantic gap between images and texts for the image captioning task, conventional studies in this area paid some attention to treating semantic concepts as a bridge between the two modalities and improved captioning performance accordingly. Although promising results on concept prediction were obtained, the aforementioned studies normally ignore the relationship among concepts, which relies on not only objects in the image, but also word dependencies in the text, so that offers a considerable potential for improving the process of generating good descriptions. In this paper, we propose a structured concept predictor (SCP) to predict concepts and their structures, then we integrate them into captioning, so as to enhance the contribution of visual signals in this task via concepts and further use their relations to distinguish cross-modal semantics for better description generation. Particularly, we design weighted graph convolutional networks (W-GCN) to depict concept relations driven by word dependencies, and then learns differentiated contributions from these concepts for following decoding process. Therefore, our approach captures potential relations among concepts and discriminatively learns different concepts, so that effectively facilitates image captioning with inherited information across modalities. Extensive experiments and their results demonstrate the effectiveness of our approach as well as each proposed module in this work.
翻译:针对图像描述生成任务中图像与文本之间的语义鸿沟难题,传统研究已关注将语义概念作为跨模态桥梁,并因此提升了描述生成性能。尽管在概念预测方面取得了显著成果,但上述研究通常忽略了概念间的关系——这种关系不仅依赖于图像中的物体,还取决于文本中的词语依赖关系,因此为改进描述生成过程提供了重要潜力。本文提出一种结构化概念预测器(SCP)来预测概念及其结构,并将其整合到描述生成中,以通过概念增强视觉信号在该任务中的贡献,并进一步利用概念关系区分跨模态语义以实现更优的描述生成。具体而言,我们设计加权图卷积网络(W-GCN)来描述由词语依赖驱动的概念关系,并学习这些概念对后续解码过程的差异化贡献。因此,本方法能够捕捉概念间的潜在关系并区分性地学习不同概念,从而通过跨模态继承信息有效促进图像描述生成。大量实验及其结果验证了本方法及所提各模块的有效性。