Current region feature-based image captioning methods have progressed rapidly and achieved remarkable performance. However, they are still prone to generating irrelevant descriptions due to the lack of contextual information and the over-reliance on generated partial descriptions for predicting the remaining words. In this paper, we propose a Dual-Stream Collaborative Transformer (DSCT) to address this issue by introducing the segmentation feature. The proposed DSCT consolidates and then fuses the region and segmentation features to guide the generation of caption sentences. It contains multiple Pattern-Specific Mutual Attention Encoders (PSMAEs) and Dynamic Nomination Decoders (DNDs). The PSMAE effectively highlights and consolidates the private information of two representations by querying each other. The DND dynamically searches for the most relevant learning blocks to the input textual representations and exploits the homogeneous features between the consolidated region and segmentation features to generate more accurate and descriptive caption sentences. To the best of our knowledge, this is the first study to explore how to fuse different pattern-specific features in a dynamic way to bypass their semantic inconsistencies and spatial misalignment issues for image captioning. The experimental results from popular benchmark datasets demonstrate that our DSCT outperforms the state-of-the-art image captioning models in the literature.
翻译:当前基于区域特征的图像描述生成方法进展迅速且性能显著。然而,由于缺乏上下文信息及过度依赖已生成的部分描述来预测后续词汇,这些方法仍易产生不相关的描述。本文提出一种双流协作Transformer(DSCT),通过引入分割特征来解决该问题。所提出的DSCT整合并融合区域特征与分割特征,以指导描述语句的生成。该模型包含多个模式特定互注意力编码器(PSMAE)和动态提名解码器(DND)。PSMAE通过双向查询机制有效突出并整合两种表征的私有信息。DND则动态搜索与输入文本表征最相关的学习模块,并利用整合后的区域特征与分割特征之间的同质特征,生成更准确且更具描述性的语句。据我们所知,本研究首次探索如何以动态方式融合不同模式特定特征,以规避其语义不一致和空间错位问题,从而提升图像描述生成性能。在主流基准数据集上的实验结果表明,我们的DSCT模型优于文献中现有的先进图像描述生成模型。