Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

Cross-lingual image captioning is a challenging task that requires addressing both cross-lingual and cross-modal obstacles in multimedia analysis. The crucial issue in this task is to model the global and the local matching between the image and different languages. Existing cross-modal embedding methods based on the transformer architecture oversee the local matching between the image region and monolingual words, especially when dealing with diverse languages. To overcome these limitations, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to establish cross-domain relationships and local correspondences between images and different languages by using a heterogeneous network. EHAT comprises Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN), and Heterogeneous Co-attention (HCA). The HARN serves as the core network and it captures cross-domain relationships by leveraging visual bounding box representation features to connect word features from two languages and to learn heterogeneous maps. MHCA and HCA facilitate cross-domain integration in the encoder through specialized heterogeneous attention mechanisms, enabling a single model to generate captions in two languages. We evaluate our approach on the MSCOCO dataset to generate captions in English and Chinese, two languages that exhibit significant differences in their language families. The experimental results demonstrate the superior performance of our method compared to existing advanced monolingual methods. Our proposed EHAT framework effectively addresses the challenges of cross-lingual image captioning, paving the way for improved multilingual image analysis and understanding.

翻译：跨语言图像描述是一项具有挑战性的任务，需要在多媒体分析中同时解决跨语言和跨模态障碍。该任务的关键问题在于建模图像与不同语言之间的全局与局部匹配。现有基于Transformer架构的跨模态嵌入方法忽略了图像区域与单语单词之间的局部匹配，尤其是在处理多样化语言时尤为突出。为克服这些局限，我们提出嵌入式异构注意力Transformer（EHAT），通过异构网络建立图像与不同语言之间的跨域关系与局部对应。EHAT包括掩码异构交叉注意力（MHCA）、异构注意力推理网络（HARN）和异构协同注意力（HCA）。其中HARN作为核心网络，通过利用视觉边界框表示特征连接两种语言的单词特征并学习异构映射，从而捕获跨域关系。MHCA与HCA通过专门的异构注意力机制促进编码器中的跨域集成，使单一模型能够生成两种语言的描述。我们在MSCOCO数据集上评估方法，生成英语和汉语描述——这两种语言在语系上存在显著差异。实验结果表明，我们的方法相较于现有先进单语方法具有更优性能。所提出的EHAT框架有效应对了跨语言图像描述的挑战，为改进多语言图像分析与理解铺平了道路。