Cross-modal retrieval (CMR) typically involves learning common representations to directly measure similarities between multimodal samples. Most existing CMR methods commonly assume multimodal samples in pairs and employ joint training to learn common representations, limiting the flexibility of CMR. Although some methods adopt independent training strategies for each modality to improve flexibility in CMR, they utilize the randomly initialized orthogonal matrices to guide representation learning, which is suboptimal since they assume inter-class samples are independent of each other, limiting the potential of semantic alignments between sample representations and ground-truth labels. To address these issues, we propose a novel method termed Deep Reversible Consistency Learning (DRCL) for cross-modal retrieval. DRCL includes two core modules, \ie Selective Prior Learning (SPL) and Reversible Semantic Consistency learning (RSC). More specifically, SPL first learns a transformation weight matrix on each modality and selects the best one based on the quality score as the Prior, which greatly avoids blind selection of priors learned from low-quality modalities. Then, RSC employs a Modality-invariant Representation Recasting mechanism (MRR) to recast the potential modality-invariant representations from sample semantic labels by the generalized inverse matrix of the prior. Since labels are devoid of modal-specific information, we utilize the recast features to guide the representation learning, thus maintaining semantic consistency to the fullest extent possible. In addition, a feature augmentation mechanism (FA) is introduced in RSC to encourage the model to learn over a wider data distribution for diversity. Finally, extensive experiments conducted on five widely used datasets and comparisons with 15 state-of-the-art baselines demonstrate the effectiveness and superiority of our DRCL.
翻译:跨模态检索(CMR)通常涉及学习共同表示以直接衡量多模态样本间的相似性。现有的大多数CMR方法通常假设多模态样本成对出现,并采用联合训练来学习共同表示,这限制了CMR的灵活性。尽管一些方法为每个模态采用独立训练策略以提高CMR的灵活性,但它们利用随机初始化的正交矩阵来指导表示学习,这是次优的,因为它们假设类间样本彼此独立,限制了样本表示与真实标签之间语义对齐的潜力。为解决这些问题,我们提出了一种称为深度可逆一致性学习(DRCL)的新方法用于跨模态检索。DRCL包含两个核心模块,即选择性先验学习(SPL)和可逆语义一致性学习(RSC)。具体而言,SPL首先在每个模态上学习一个变换权重矩阵,并根据质量评分选择最佳矩阵作为先验,这极大地避免了从低质量模态中盲目选择先验。随后,RSC采用模态不变表示重铸机制(MRR),通过先验的广义逆矩阵从样本语义标签中重铸潜在的模态不变表示。由于标签不含模态特定信息,我们利用重铸特征来指导表示学习,从而最大程度地保持语义一致性。此外,RSC中引入了特征增强机制(FA),以鼓励模型在更广泛的数据分布上学习以增加多样性。最后,在五个广泛使用的数据集上进行的大量实验,以及与15个最先进基线的比较,证明了我们DRCL方法的有效性和优越性。