Universal sound separation (USS) is a task to separate arbitrary sounds from an audio mixture. Existing USS systems are capable of separating arbitrary sources, given a few examples of the target sources as queries. However, separating arbitrary sounds with a single system is challenging, and the robustness is not always guaranteed. In this work, we propose audio prompt tuning (APT), a simple yet effective approach to enhance existing USS systems. Specifically, APT improves the separation performance of specific sources through training a small number of prompt parameters with limited audio samples, while maintaining the generalization of the USS model by keeping its parameters frozen. We evaluate the proposed method on MUSDB18 and ESC-50 datasets. Compared with the baseline model, APT can improve the signal-to-distortion ratio performance by 0.67 dB and 2.06 dB using the full training set of two datasets. Moreover, APT with only 5 audio samples even outperforms the baseline systems utilizing full training data on the ESC-50 dataset, indicating the great potential of few-shot APT.
翻译:通用声音分离(USS)是一项从音频混合中分离任意声音的任务。现有的USS系统能够分离任意声源,只需提供目标声源的少量样本作为查询。然而,使用单一系统分离任意声音具有挑战性,其鲁棒性也并非总能得到保证。本文提出音频提示调谐(APT),一种简单而有效的方法来增强现有USS系统。具体而言,APT通过使用有限音频样本训练少量提示参数来提升特定声源的分离性能,同时保持USS模型参数固定以维持其泛化能力。我们在MUSDB18和ESC-50数据集上评估了所提方法。与基线模型相比,APT在使用两个数据集的完整训练集时,可将信号失真比性能分别提升0.67 dB和2.06 dB。此外,仅使用5个音频样本的APT在ESC-50数据集上的性能甚至优于使用完整训练数据的基线系统,这表明少样本APT具有巨大潜力。