Cross-modal retrieval has become popular in recent years, particularly with the rise of multimedia. Generally, the information from each modality exhibits distinct representations and semantic information, which makes feature tends to be in separate latent spaces encoded with dual-tower architecture and makes it difficult to establish semantic relationships between modalities, resulting in poor retrieval performance. To address this issue, we propose a novel framework for cross-modal retrieval which consists of a cross-modal mixer, a masked autoencoder for pre-training, and a cross-modal retriever for downstream tasks.In specific, we first adopt cross-modal mixer and mask modeling to fuse the original modality and eliminate redundancy. Then, an encoder-decoder architecture is applied to achieve a fuse-then-separate task in the pre-training phase.We feed masked fused representations into the encoder and reconstruct them with the decoder, ultimately separating the original data of two modalities. In downstream tasks, we use the pre-trained encoder to build the cross-modal retrieval method. Extensive experiments on 2 real-world datasets show that our approach outperforms previous state-of-the-art methods in video-audio matching tasks, improving retrieval accuracy by up to 2 times. Furthermore, we prove our model performance by transferring it to other downstream tasks as a universal model.
翻译:跨模态检索近年来随着多媒体的兴起而广受欢迎。由于各模态信息通常表现出不同的表征和语义特征,导致特征倾向于采用双塔架构编码至独立的潜在空间,难以建立模态间的语义关联,进而造成检索性能不佳。为解决该问题,我们提出一种新型跨模态检索框架,包含跨模态混合器、用于预训练的掩码自编码器以及用于下游任务的跨模态检索器。具体而言,我们首先采用跨模态混合器与掩码建模对原始模态进行融合并消除冗余;随后在预训练阶段,应用编码器-解码器架构实现“融合-分离”任务:将掩码融合表征输入编码器,通过解码器重构并最终分离两个模态的原始数据。在下游任务中,我们使用预训练编码器构建跨模态检索方法。在2个真实世界数据集上的大量实验表明,本方法在视频-音频匹配任务中超越先前最优方法,检索准确率提升高达2倍。此外,我们通过将模型作为通用模型迁移至其他下游任务,验证了其性能优越性。