Image-text retrieval which associates different modalities has drawn broad attention due to its excellent research value and broad real-world application. While the algorithms keep updated, most of them haven't taken the high-level semantic relationships ("style embedding") and common knowledge from multi-modalities into full consideration. To this end, we propose a novel style transformer network with common knowledge optimization (CKSTN) for image-text retrieval. The main module is the common knowledge adaptor (CKA) with both the style embedding extractor (SEE) and the common knowledge optimization (CKO) modules. Specifically, the SEE is designed to effectively extract high-level features. The CKO module is introduced to dynamically capture the latent concepts of common knowledge from different modalities. Together, they could assist in the formation of item representations in lightweight transformers. Besides, to get generalized temporal common knowledge, we propose a sequential update strategy to effectively integrate the features of different layers in SEE with previous common feature units. CKSTN outperforms the results of state-of-the-art methods in image-text retrieval on MSCOCO and Flickr30K datasets. Moreover, CKSTN is more convenient and practical for the application of real scenes, due to the better performance and lower parameters.
翻译:图像-文本检索通过关联不同模态的数据,因其卓越的研究价值和广泛的现实应用而受到广泛关注。尽管算法不断更新,但大多数方法未能充分考虑多模态间的高层语义关系(“风格嵌入”)及常识知识。为此,我们提出一种新颖的基于常识知识优化的风格变换器网络(CKSTN),用于图像-文本检索。其核心模块为常识知识适配器(CKA),包含风格嵌入提取器(SEE)和常识知识优化(CKO)两个子模块。具体而言,SEE用于高效提取高层特征,CKO模块则动态捕获不同模态中潜在的常识知识概念。两者协同作用,可辅助轻量级变换器中项目表示的形成。此外,为获取泛化的时序常识知识,我们提出一种序列更新策略,将SEE中不同层的特征与先前的常识特征单元有效整合。在MSCOCO和Flickr30K数据集上,CKSTN在图像-文本检索任务中超越了现有最优方法的结果。同时,由于更优的性能和更低的参数量,CKSTN在实际场景应用中更加便捷实用。