Audio separation in real-world scenarios, where mixtures contain a variable number of sources, presents significant challenges due to limitations of existing models, such as over-separation, under-separation, and dependence on predefined training sources. We propose OpenSep, a novel framework that leverages large language models (LLMs) for automated audio separation, eliminating the need for manual intervention and overcoming source limitations. OpenSep uses textual inversion to generate captions from audio mixtures with off-the-shelf audio captioning models, effectively parsing the sound sources present. It then employs few-shot LLM prompting to extract detailed audio properties of each parsed source, facilitating separation in unseen mixtures. Additionally, we introduce a multi-level extension of the mix-and-separate training framework to enhance modality alignment by separating single source sounds and mixtures simultaneously. Extensive experiments demonstrate OpenSep's superiority in precisely separating new, unseen, and variable sources in challenging mixtures, outperforming SOTA baseline methods. Code is released at https://github.com/tanvir-utexas/OpenSep.git
翻译:现实场景中的音频分离面临显著挑战,因为混合音频包含数量不定的声源,而现有模型存在过度分离、分离不足及依赖预定义训练声源等局限性。我们提出OpenSep,一种利用大型语言模型实现自动化音频分离的新型框架,无需人工干预且能克服声源限制。OpenSep采用文本反转技术,通过现成的音频描述模型从混合音频生成描述文本,有效解析存在的声源。随后通过少样本LLM提示提取每个解析声源的详细音频属性,从而实现对未知混合音频的分离。此外,我们引入混合分离训练框架的多层次扩展,通过同时分离单一声源和混合音频来增强模态对齐。大量实验表明,OpenSep在具有挑战性的混合音频中能精确分离新的、未知的及多变的声源,其性能优于当前最先进的基线方法。代码发布于 https://github.com/tanvir-utexas/OpenSep.git