CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

翻译：音频的基本特征之一是其组合性质。通过对比学习（如CLAP）训练的音频-语言模型（ALMs）能够学习音频与语言模态之间的共享表征，从而在零样本音频分类、音频检索等众多下游应用中提升了性能。然而，这些模型有效执行组合推理的能力仍鲜有探索，亟需进一步研究。本文提出CompA——一个包含两个专家标注基准的数据集，其样本主要来源于真实世界音频——用于评估ALMs的组合推理能力。我们提出的CompA基准包含两个子集：CompA-order评估ALM对音频中声学事件顺序或发生次序的理解程度，CompA-attribute则评估声学事件的属性绑定能力。每个基准实例由两个音频-文本描述对组成，两个音频包含相同的声学事件但组合方式不同。通过评估ALM将正确音频与正确文本描述匹配的能力，我们首先发现当前ALMs的表现仅略优于随机猜测，难以应对组合推理任务。随后，我们提出CompA-CLAP，通过一种新颖的学习方法微调CLAP以提升其组合推理能力。为训练CompA-CLAP，我们首先改进了对比训练方法，引入组合感知的困难负样本，实现更聚焦的训练。接着，我们提出一种模块化对比损失函数，帮助模型学习细粒度的组合理解，并克服公开可用的组合音频数据严重匮乏的问题。在CompA基准上，CompA-CLAP相较于所有基线模型均有显著提升，展现出卓越的组合推理能力。