COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

翻译：对比语言-音频预训练（CLAP）模型广泛用于音频理解，并支持许多零样本应用中模态无关的条件交换。然而，其性能严重受限于音频嵌入与文本嵌入之间的模态间隙。现有解释主要将此间隙归因于锥体效应，将其视为均值嵌入间的偏移，但仅修正均值带来的提升有限。信息不平衡和维度坍缩等替代假说虽已被提出，但在音频领域尚未得到充分验证与深入研究。与此同时，部分研究尝试将多模态对比嵌入分解为可解释概念，但尚无工作从概念分解角度显式分析模态间隙。本文提出COMET（基于PLS-SVD变换的概念空间组织与模态间隙解释）——一个面向CLAP的新型偏最小二乘奇异值分解（PLS-SVD）框架，揭示了模态间隙的广域视角。该框架表明：仅少数可解释的轴子集（捕获共享概念）对相似度计算贡献显著，且均值分量仅部分表征模态间隙。基于此洞见，我们提出一种免训练的简单谱截断方法以缓解模态间隙。该方法使零样本音频字幕生成通过条件交换逼近全监督性能，无需大型辅助记忆库或高昂计算开销。同时，该方法在实现可观嵌入维度压缩的同时，在检索与音频字幕生成任务中保持强劲性能。