Universal sound separation (USS) is a task of separating mixtures of arbitrary sound sources. Typically, universal separation models are trained from scratch in a supervised manner, using labeled data. Self-supervised learning (SSL) is an emerging deep learning approach that leverages unlabeled data to obtain task-agnostic representations, which can benefit many downstream tasks. In this paper, we propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system to enhance its separation performance. We employ two strategies to utilize SSL embeddings: freezing or updating the parameters of A-MAE during fine-tuning. The SSL embeddings are concatenated with the short-time Fourier transform (STFT) to serve as input features for the separation model. We evaluate our methods on the AudioSet dataset, and the experimental results indicate that the proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.
翻译:通用声音分离(USS)旨在分离任意声源混合信号的任务。通常,通用分离模型采用监督学习方式,利用标注数据从头开始训练。自监督学习(SSL)是一种新兴的深度学习范式,它利用未标注数据获取任务无关的表征,可提升多种下游任务的性能。本文提出将自监督预训练模型——音频掩码自编码器(A-MAE)——集成到通用声音分离系统中,以增强其分离性能。我们采用两种策略利用SSL嵌入特征:在微调过程中冻结或更新A-MAE的参数。SSL嵌入特征与短时傅里叶变换(STFT)特征拼接后,作为分离模型的输入特征。我们在AudioSet数据集上评估所提方法,实验结果表明,所提方法有效提升了基于ResUNet的先进通用声音分离模型的分离性能。