Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss
翻译:通用音源分离(Universal Source Separation, USS)是计算听觉场景分析中的一项基础研究任务,旨在将单声道录音分离为独立的声源轨道。音频源分离任务面临三个潜在挑战:首先,以往的音频源分离系统主要专注于分离一种或有限数量的特定声源,缺乏通过单一模型分离任意声源的统一系统研究;其次,大多数先前系统需要纯净声源数据进行训练,而纯净声源数据极度稀缺;第三,缺乏能够自动检测并分层分离活跃声音类别的USS系统。为利用大规模弱标注/无标注音频数据进行音频源分离,我们提出一个通用音频源分离框架,包含:1)基于弱标注数据训练的音频标记模型作为查询网络;2)以查询网络输出为条件、可分离任意声源的条件式源分离模型。我们研究了多种查询网络、源分离模型及训练策略,并提出一种层级式USS策略,可自动检测并分离AudioSet本体中的声音类别。通过仅利用弱标注的AudioSet数据集,我们的USS系统成功分离了多种声音类别,包括声音事件分离、音乐源分离和语音增强。该系统在AudioSet的527个声音类别上实现了平均5.57 dB的信号失真比提升(SDRi),在DCASE 2018 Task 2数据集上达到10.57 dB,在MUSDB18数据集上为8.12 dB,在Slakh2100数据集上SDRi为7.28 dB,在voicebank-demand数据集上信噪比(SSNR)达9.00 dB。我们已在https://github.com/bytedance/uss 开源了源代码。