Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: https://github.com/mbrotos/SoundSeg
翻译:从音乐音轨中分离声乐元素是音频信号处理领域长期存在的挑战。本研究针对音乐频谱图中声乐成分的精确分离问题展开。我们采用短时傅里叶变换(STFT)将音频波形转换为精细的时频频谱图,并利用基准数据集MUSDB18进行音乐分离。随后,我们部署U-Net神经网络对频谱图像进行分割,旨在精确勾勒并提取歌唱声部成分。基于U-Net的模型在音源分离任务中取得了显著成果。结合频率轴归一化(采用最小/最大值缩放)与平均绝对误差(MAE)损失函数的配置,实现了7.1 dB的最高源失真比(SDR),表明该方案在分离过程中能高度保持原始信号质量。该配置同时获得了25.2 dB的源干扰比(SIR)与7.2 dB的源伪影比(SAR),这些指标显著优于其他配置方案(特别是采用分位数归一化或均方误差损失函数的方案)。项目源代码、模型权重及演示材料已发布于GitHub仓库:https://github.com/mbrotos/SoundSeg