With the proliferation of video platforms on the internet, recording musical performances by mobile devices has become commonplace. However, these recordings often suffer from degradation such as noise and reverberation, which negatively impact the listening experience. Consequently, the necessity for music audio enhancement (referred to as music enhancement from this point onward), involving the transformation of degraded audio recordings into pristine high-quality music, has surged to augment the auditory experience. To address this issue, we propose a music enhancement system based on the Conformer architecture that has demonstrated outstanding performance in speech enhancement tasks. Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task. Our experimental results show that our proposed model achieves state-of-the-art performance on single-stem music enhancement. Furthermore, our system can perform general music enhancement with multi-track mixtures, which has not been examined in previous work.
翻译:随着互联网上视频平台的普及,使用移动设备录制音乐表演已成为常态。然而,这些录制经常受到噪声和混响等退化影响,损害了听觉体验。因此,音乐音频增强(以下简称音乐增强)的需求——将退化的音频录制转化为纯净的高质量音乐——已显著增长,以提升听觉体验。针对此问题,我们提出了一种基于Conformer架构的音乐增强系统,该架构已在语音增强任务中展现出卓越性能。我们的研究探讨了Conformer的注意力机制,并评估其性能以找出音乐增强任务的最佳方法。实验结果表明,我们提出的模型在单音轨音乐增强上达到了最先进的性能。此外,我们的系统能够处理多轨混合的通用音乐增强任务,这在以往的研究中尚未被探讨。