Image-text matching aims to find matched cross-modal pairs accurately. While current methods often rely on projecting cross-modal features into a common embedding space, they frequently suffer from imbalanced feature representations across different modalities, leading to unreliable retrieval results. To address these limitations, we introduce a novel Feature Enhancement Module that adaptively aggregates single-modal features for more balanced and robust image-text retrieval. Additionally, we propose a new loss function that overcomes the shortcomings of original triplet ranking loss, thereby significantly improving retrieval performance. The proposed model has been evaluated on two public datasets and achieves competitive retrieval performance when compared with several state-of-the-art models. Implementation codes can be found here.
翻译:图像-文本匹配旨在精确寻找跨模态配对样本。当前方法通常依赖将跨模态特征投影至公共嵌入空间,但常因不同模态间的特征表示失衡而导致检索结果不可靠。针对上述局限,我们提出新型特征增强模块,该模块可自适应聚合单模态特征以实现更均衡稳健的图像-文本检索。此外,我们设计了一种新型损失函数,有效克服了原始三元组排序损失的缺陷,从而显著提升检索性能。所提模型已在两个公开数据集上完成评估,与多个最优模型相比展现出具有竞争力的检索性能。实现代码可于此处获取。