Several attempts have been made to handle multiple source separation tasks such as speech enhancement, speech separation, sound event separation, music source separation (MSS), or cinematic audio source separation (CASS) with a single model. These models are trained on large-scale data including speech, instruments, or sound events and can often successfully separate a wide range of sources. However, it is still challenging for such models to cover all separation tasks because some of them are contradictory (e.g., musical instruments are separated in MSS while they have to be grouped in CASS). To overcome this issue and support all the major separation tasks, we propose a task-aware unified source separation (TUSS) model. The model uses a variable number of learnable prompts to specify which source to separate, and changes its behavior depending on the given prompts, enabling it to handle all the major separation tasks including contradictory ones. Experimental results demonstrate that the proposed TUSS model successfully handles the five major separation tasks mentioned earlier. We also provide some audio examples, including both synthetic mixtures and real recordings, to demonstrate how flexibly the TUSS model changes its behavior at inference depending on the prompts.
翻译:已有多种尝试致力于使用单一模型处理多种源分离任务,例如语音增强、语音分离、声事件分离、音乐源分离(MSS)或电影音频源分离(CASS)。这些模型在包含语音、乐器或声事件的大规模数据上进行训练,通常能够成功分离多种来源的音频。然而,此类模型要覆盖所有分离任务仍具挑战性,因为其中一些任务相互矛盾(例如,在MSS中需要分离乐器,而在CASS中却需要将它们组合在一起)。为克服此问题并支持所有主要分离任务,我们提出了一种任务感知的统一源分离(TUSS)模型。该模型使用可变数量的可学习提示来指定需要分离的声源,并根据给定的提示改变其行为,从而使其能够处理包括相互矛盾任务在内的所有主要分离任务。实验结果表明,所提出的TUSS模型成功处理了前述五种主要分离任务。我们还提供了一些音频示例,包括合成混合音频和真实录音,以展示TUSS模型在推理过程中如何根据提示灵活地改变其行为。