We study the problem of stereo singing voice cancellation, a subtask of music source separation, whose goal is to estimate an instrumental background from a stereo mix. We explore how to achieve performance similar to large state-of-the-art source separation networks starting from a small, efficient model for real-time speech separation. Such a model is useful when memory and compute are limited and singing voice processing has to run with limited look-ahead. In practice, this is realised by adapting an existing mono model to handle stereo input. Improvements in quality are obtained by tuning model parameters and expanding the training set. Moreover, we highlight the benefits a stereo model brings by introducing a new metric which detects attenuation inconsistencies between channels. Our approach is evaluated using objective offline metrics and a large-scale MUSHRA trial, confirming the effectiveness of our techniques in stringent listening tests.
翻译:我们研究了立体声人声消除问题,这是音乐源分离的一个子任务,其目标是从立体声混合音频中估计出乐器背景声。我们探索如何从用于实时语音分离的小型高效模型出发,达到与当前先进的大规模源分离网络相似的性能。在内存和计算资源有限、且人声处理必须限制前瞻量的场景下,此类模型具有实用价值。具体实现中,通过改造现有单声道模型以处理立体声输入。通过调整模型参数并扩展训练集,可获得质量提升。此外,我们引入一种检测声道间衰减不一致性的新指标,凸显了立体声模型的优势。我们使用客观离线指标和大规模MUSHRA听力测试对所提方法进行评估,在严格的听音实验中证实了其有效性。