Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.
翻译:语言查询音频源分离(LASS)是计算听觉场景分析(CASA)的一种新范式。LASS旨在根据给定的自然语言查询从音频混合物中分离出目标声音,这为数字音频应用提供了一个自然且可扩展的接口。尽管最近的LASS研究在特定声源(例如乐器、有限类别的音频事件)上取得了有前景的分离性能,但它们无法处理开放域内的音频概念分离。在本工作中,我们提出了AudioSep,一个基于自然语言查询的开放域音频源分离基础模型。我们在大规模多模态数据集上训练AudioSep,并在包括音频事件分离、乐器分离和语音增强在内的众多任务上广泛评估其能力。AudioSep使用音频描述或文本标签作为查询,展现出强大的分离性能和令人印象深刻的零样本泛化能力,显著超越了以往的音频查询和语言查询声音分离模型。为了确保本工作的可复现性,我们将在以下网址发布源代码、评估基准和预训练模型:https://github.com/Audio-AGI/AudioSep。