Recent literature uses language to build foundation models for audio. These Audio-Language Models (ALMs) are trained on a vast number of audio-text pairs and show remarkable performance in tasks including Text-to-Audio Retrieval, Captioning, and Question Answering. However, their ability to engage in more complex open-ended tasks, like Interactive Question-Answering, requires proficiency in logical reasoning -- a skill not yet benchmarked. We introduce the novel task of Audio Entailment to evaluate an ALM's deductive reasoning ability. This task assesses whether a text description (hypothesis) of audio content can be deduced from an audio recording (premise), with potential conclusions being entailment, neutral, or contradiction, depending on the sufficiency of the evidence. We create two datasets for this task with audio recordings sourced from two audio captioning datasets -- AudioCaps and Clotho -- and hypotheses generated using Large Language Models (LLMs). We benchmark state-of-the-art ALMs and find deficiencies in logical reasoning with both zero-shot and linear probe evaluations. Finally, we propose "caption-before-reason", an intermediate step of captioning that improves the zero-shot and linear-probe performance of ALMs by an absolute 6% and 3%, respectively.
翻译:近期研究利用语言构建音频基础模型。这些音频-语言模型(ALMs)通过海量音频-文本对进行训练,在文本到音频检索、描述生成和问答等任务中展现出卓越性能。然而,其在更复杂的开放式任务(如交互式问答)中的表现需要逻辑推理能力——这一技能尚未被系统评估。我们提出音频蕴含这一创新任务,用以评估ALMs的演绎推理能力。该任务通过判断音频记录(前提)是否能够推导出对音频内容的文本描述(假设),其结论根据证据充分性可分为蕴含、中立或矛盾三类。我们基于AudioCaps和Clotho两个音频描述数据集中的音频记录,并利用大语言模型(LLMs)生成假设,构建了两个用于此任务的数据集。通过对前沿ALMs进行基准测试,我们发现其在零样本和线性探针评估中均存在逻辑推理缺陷。最后,我们提出“先描述后推理”的中间步骤,通过生成描述显著提升了ALMs的推理性能,使零样本和线性探针评估结果分别绝对提升6%和3%。