Music source separation aims to separate polyphonic music into different types of sources. Most existing methods focus on enhancing the quality of separated results by using a larger model structure, rendering them unsuitable for deployment on edge devices. Moreover, these methods may produce low-quality output when the input duration is short, making them impractical for real-time applications. Therefore, the goal of this paper is to enhance a lightweight model, MMDenstNet, to strike a balance between separation quality and latency for real-time applications. Different directions of improvement are explored or proposed in this paper, including complex ideal ratio mask, self-attention, band-merge-split method, and feature look back. Source-to-distortion ratio, real-time factor, and optimal latency are employed to evaluate the performance. To align with our application requirements, the evaluation process in this paper focuses on the separation performance of the accompaniment part. Experimental results demonstrate that our improvement achieves low real-time factor and optimal latency while maintaining acceptable separation quality.
翻译:音乐源分离旨在将复调音乐分离为不同类型的声源。现有方法大多通过采用更大的模型结构来提升分离结果的质量,导致其难以部署于边缘设备。此外,当输入音频时长较短时,这些方法可能产生低质量输出,使其无法满足实时应用需求。因此,本文旨在增强轻量级模型MMDenseNet,以在分离质量与延迟之间取得平衡,从而适用于实时场景。本文探索或提出了多种改进方向,包括复数理想比值掩码、自注意力机制、频带合并拆分方法及特征回溯机制。采用源失真比、实时因子和最优延迟作为性能评估指标。为契合应用需求,本文的评估过程重点关注伴奏部分的分离性能。实验结果表明,我们的改进方案在保持可接受分离质量的同时,实现了较低的实时因子与最优延迟。