Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.
翻译:通用源分离旨在分离任意混合音频中的音频源,从而摆脱在语音或音乐等特定领域操作的约束。然而,通用源分离的潜力受到限制,因为大多数现有工作主要关注以声音事件为主的混合音频,且小规模训练数据集也限制了监督学习的潜力。本文研究了一种基于大规模数据集、以监督方式训练的统一通用音频源分离(GASS)模型,该模型能够分离语音、音乐和声音事件。我们在多种任务上评估了GASS模型。其优异的分布内结果展示了GASS模型的可行性,而在声音事件与语音分离任务中具有竞争力的分布外表现则体现了其泛化能力。然而,GASS模型在泛化至分布外的电影及音乐内容分离时仍面临挑战。我们还对每个数据集进行了GASS模型微调,其表现始终优于未预训练的模型。所有微调后的模型(除音乐分离模型外)均在各自基准任务中取得了最优结果。