Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.
翻译:语言查询音频源分离(LASS)是计算听觉场景分析(CASA)领域的一种新范式。LASS旨在根据自然语言查询从音频混合中分离出目标声音,为数字音频应用提供了自然且可扩展的接口。尽管近期LASS研究在特定声源(如乐器、有限类别的音频事件)上取得了令人鼓舞的分离性能,但无法处理开放域中的音频概念。为此,我们提出了AudioSep——一种基于自然语言查询的开放域音频源分离基础模型。我们在大规模多模态数据集上训练AudioSep,并在音频事件分离、乐器分离和语音增强等多项任务中全面评估其能力。实验表明,AudioSep以音频描述或文本标签作为查询时,展现出强大的分离性能和出色的零样本泛化能力,显著优于以往的音频查询和语言查询声音分离模型。为确保本研究的可复现性,我们将开源源代码、评估基准和预训练模型,地址为:https://github.com/Audio-AGI/AudioSep。