Cross-modal retrieval (CMR) has been extensively applied in various domains, such as multimedia search engines and recommendation systems. Most existing CMR methods focus on image-to-text retrieval, whereas audio-to-text retrieval, a less explored domain, has posed a great challenge due to the difficulty to uncover discriminative features from audio clips and texts. Existing studies are restricted in the following two ways: 1) Most researchers utilize contrastive learning to construct a common subspace where similarities among data can be measured. However, they considers only cross-modal transformation, neglecting the intra-modal separability. Besides, the temperature parameter is not adaptively adjusted along with semantic guidance, which degrades the performance. 2) These methods do not take latent representation reconstruction into account, which is essential for semantic alignment. This paper introduces a novel audio-text oriented CMR approach, termed Contrastive Latent Space Reconstruction Learning (CLSR). CLSR improves contrastive representation learning by taking intra-modal separability into account and adopting an adaptive temperature control strategy. Moreover, the latent representation reconstruction modules are embedded into the CMR framework, which improves modal interaction. Experiments in comparison with some state-of-the-art methods on two audio-text datasets have validated the superiority of CLSR.
翻译:跨模态检索(CMR)已广泛应用于多媒体搜索引擎和推荐系统等多个领域。现有CMR方法大多聚焦于图像-文本检索,而音频-文本检索这一较少探索的领域,由于难以从音频片段和文本中提取判别性特征而面临巨大挑战。现有研究存在以下两方面局限性:1)大多数研究者利用对比学习构建可度量数据相似性的公共子空间,但仅考虑跨模态变换而忽略了模态内可分离性。此外,温度参数未随语义引导进行自适应调整,导致性能下降。2)这些方法未考虑对语义对齐至关重要的潜在表示重构。本文提出一种新颖的面向音频-文本的CMR方法,称为对比性潜在空间重构学习(CLSR)。CLSR通过考虑模态内可分离性并采用自适应温度控制策略来改进对比表示学习。此外,潜在表示重构模块被嵌入CMR框架中,从而增强模态交互。在两个音频-文本数据集上与若干最先进方法的对比实验结果验证了CLSR的优越性。