Conventional audio classification relied on predefined classes, lacking the ability to learn from free-form text. Recent methods unlock learning joint audio-text embeddings from raw audio-text pairs describing audio in natural language. Despite recent advancements, there is little exploration of systematic methods to train models for recognizing sound events and sources in alternative scenarios, such as distinguishing fireworks from gunshots at outdoor events in similar situations. This study introduces causal reasoning and counterfactual analysis in the audio domain. We use counterfactual instances and include them in our model across different aspects. Our model considers acoustic characteristics and sound source information from human-annotated reference texts. To validate the effectiveness of our model, we conducted pre-training utilizing multiple audio captioning datasets. We then evaluate with several common downstream tasks, demonstrating the merits of the proposed method as one of the first works leveraging counterfactual information in audio domain. Specifically, the top-1 accuracy in open-ended language-based audio retrieval task increased by more than 43%.
翻译:传统音频分类依赖预定义类别,缺乏从自由形式文本中学习的能力。近年来的方法实现了从描述音频的自然语言原始音频-文本对中联合学习音频-文本嵌入。尽管取得了进展,但针对在替代场景中训练模型识别声音事件和源的系统性方法探索仍不足,例如在户外活动中区分烟花与枪声的相似情境。本研究将因果推理与反事实分析引入音频领域。我们构建反事实实例,并将其纳入模型的不同层面。该模型利用人工标注参考文本中的声学特征与声源信息。为验证模型有效性,我们利用多个音频描述数据集进行预训练,随后通过多项常见下游任务进行评估,结果表明该方法作为音频领域首批利用反事实信息的工作之一具有显著优势。具体而言,在开放式语言音频检索任务中,top-1准确率提升超过43%。