The success of speech-image retrieval relies on establishing an effective alignment between speech and image. Existing methods often model cross-modal interaction through simple cosine similarity of the global feature of each modality, which fall short in capturing fine-grained details within modalities. To address this issue, we introduce an effective framework and a novel learning task named cross-modal denoising (CMD) to enhance cross-modal interaction to achieve finer-level cross-modal alignment. Specifically, CMD is a denoising task designed to reconstruct semantic features from noisy features within one modality by interacting features from another modality. Notably, CMD operates exclusively during model training and can be removed during inference without adding extra inference time. The experimental results demonstrate that our framework outperforms the state-of-the-art method by 2.0% in mean R@1 on the Flickr8k dataset and by 1.7% in mean R@1 on the SpokenCOCO dataset for the speech-image retrieval tasks, respectively. These experimental results validate the efficiency and effectiveness of our framework.
翻译:语音-图像检索的成功依赖于在语音与图像之间建立有效的对齐。现有方法通常通过各模态全局特征的简单余弦相似度来建模跨模态交互,这难以捕捉模态内部的细粒度细节。为解决这一问题,我们引入了一个有效框架及一项名为跨模态去噪(CMD)的新型学习任务,以增强跨模态交互,实现更精细层级的跨模态对齐。具体而言,CMD是一项去噪任务,旨在通过另一模态的特征交互,从某一模态的含噪特征中重建语义特征。值得注意的是,CMD仅在模型训练阶段运行,在推理阶段可被移除,不会增加额外推理时间。实验结果表明,在语音-图像检索任务中,我们的框架在Flickr8k数据集上的平均R@1指标优于现有最佳方法2.0%,在SpokenCOCO数据集上平均R@1指标提升1.7%。这些实验结果验证了我们框架的高效性与有效性。