Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are plug-and-play: both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods. Code and pre-trained models are available at: https://github.com/Paranioar/RCAR.
翻译:利用细粒度对应关系与视觉-语义对齐在图像-文本匹配中展现出巨大潜力。当前主流方法通常首先采用跨模态注意力单元捕捉潜在的区域-词汇交互,随后整合所有对齐结果以获得最终相似度。然而,多数方法采用一次性前向关联或聚合策略,依赖复杂架构或额外信息,却忽视了网络反馈的调节能力。本文提出两种简洁高效的调节器,通过有效编码输出消息自动实现跨模态表示的语境化与聚合。具体而言,我们提出:(i)循环对应调节器(RCR),通过自适应注意力因子逐步优化跨模态注意力单元,以捕获更灵活的对应关系;(ii)循环聚合调节器(RAR),通过反复调整聚合权重,逐步强化重要对齐并弱化非关键对齐。值得关注的是,RCR与RAR具备即插即用特性:二者可无缝集成至基于跨模态交互的多种框架中并显著提升性能,其协同作用更可带来进一步改进。在MSCOCO与Flickr30K数据集上的大量实验表明,该方法能为多个模型带来稳定且显著的R@1指标提升,验证了其通用有效性与泛化能力。代码与预训练模型已开源于:https://github.com/Paranioar/RCAR。