CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models

A fundamental characteristic of audio is its compositional nature. Audio-language models (ALMs) trained using a contrastive approach (e.g., CLAP) that learns a shared representation between audio and language modalities have improved performance in many downstream applications, including zero-shot audio classification, audio retrieval, etc. However, the ability of these models to effectively perform compositional reasoning remains largely unexplored and necessitates additional research. In this paper, we propose CompA, a collection of two expert-annotated benchmarks with a majority of real-world audio samples, to evaluate compositional reasoning in ALMs. Our proposed CompA-order evaluates how well an ALM understands the order or occurrence of acoustic events in audio, and CompA-attribute evaluates attribute-binding of acoustic events. An instance from either benchmark consists of two audio-caption pairs, where both audios have the same acoustic events but with different compositions. An ALM is evaluated on how well it matches the right audio to the right caption. Using this benchmark, we first show that current ALMs perform only marginally better than random chance, thereby struggling with compositional reasoning. Next, we propose CompA-CLAP, where we fine-tune CLAP using a novel learning method to improve its compositional reasoning abilities. To train CompA-CLAP, we first propose improvements to contrastive training with composition-aware hard negatives, allowing for more focused training. Next, we propose a novel modular contrastive loss that helps the model learn fine-grained compositional understanding and overcomes the acute scarcity of openly available compositional audios. CompA-CLAP significantly improves over all our baseline models on the CompA benchmark, indicating its superior compositional reasoning capabilities.

翻译：音频的一个基本特性是其组合性。采用对比学习方法（如CLAP）训练的音频-语言模型通过建立音频与语言模态间的共享表征，在零样本音频分类、音频检索等下游应用中取得了性能提升。然而，这些模型有效执行组合推理的能力尚未得到充分探索，亟需深入研究。本文提出CompA——包含两个由专家标注、以真实音频样本为主的评测基准，用于评估音频-语言模型的组合推理能力。其中CompA-order评估模型对音频中声学事件时序关系的理解，CompA-attribute评估声学事件与属性的绑定关系。每个基准的测试实例包含两组音频-描述对，两组音频包含相同的声学事件但具有不同的组合方式。评估标准是模型能否准确匹配正确的音频与描述。基于该基准，我们首先发现现有音频-语言模型的性能仅略高于随机猜测，表明其在组合推理方面存在明显不足。随后，我们提出CompA-CLAP模型，通过新颖的学习方法对CLAP进行微调以提升其组合推理能力。在训练中，我们首先改进对比学习方法，引入组合感知的困难负样本以实现更有针对性的训练。其次，我们提出模块化对比损失函数，该函数能帮助模型学习细粒度的组合关系，并缓解公开组合音频数据严重匮乏的问题。实验表明，CompA-CLAP在CompA基准上显著超越所有基线模型，展现出卓越的组合推理能力。