Phase Repair for Time-Domain Convolutional Neural Networks in Music Super-Resolution

Audio Super-Resolution (SR) is an important topic as low-resolution recordings are ubiquitous in daily life. In this paper, we focus on the music SR task, which is challenging due to the wide frequency response and dynamic range of music. Many models are designed in time domain to jointly process magnitude and phase of audio signals. However, prior works show that approaches using Time-Domain Convolutional Neural Network (TD-CNN) tend to produce annoying artifacts in their waveform outputs, and the cause of the artifacts is yet to be identified. To the best of our knowledge, this work is the first to demonstrate the artifacts in TD-CNNs are caused by the phase distortion via a subjective experiment. We further propose Time-Domain Phase Repair (TD-PR), which uses a neural vocoder pre-trained on the wide-band data to repair the phase components in the waveform outputs of TD-CNNs. Although the vocoder and TD-CNNs are independently trained, the proposed TD-PR obtained better mean opinion score, significantly improving the perceptual quality of TD-CNN baselines. Since the proposed TD-PR only repairs the phase components of the waveforms, the improved perceptual quality in turn indicates that phase distortion has been the cause of the annoying artifacts of TD-CNNs. Moreover, a single pretrained vocoder can be directly applied to arbitrary TD-CNNs without additional adaptation. Therefore, we apply TD-PR to three TD-CNNs that have different architecture and parameter amount. Consistent improvements are observed when TD-PR is applied to all three TD-CNN baselines. Audio samples are available on the demo page.

翻译：音频超分辨率（SR）是一个重要课题，因为低分辨率录音在日常生活中普遍存在。本文聚焦于音乐SR任务，该任务因音乐宽频率响应与大动态范围而具有挑战性。许多模型在时域中设计，以联合处理音频信号的幅度与相位。然而，先前研究表明，使用时域卷积神经网络（TD-CNN）的方法往往在其波形输出中产生令人困扰的伪影，而伪影的成因尚待明确。据我们所知，本研究首次通过主观实验证明TD-CNN中的伪影是由相位失真引起的。我们进一步提出时域相位修复（TD-PR），该方法利用在宽带数据上预训练的神经声码器修复TD-CNN波形输出中的相位分量。尽管声码器与TD-CNN独立训练，所提出的TD-PR获得了更优的平均意见得分，显著提升了TD-CNN基线的感知质量。由于TD-PR仅修复波形的相位分量，感知质量的提升反过来表明相位失真正是TD-CNN产生恼人伪影的成因。此外，单个预训练声码器可直接应用于任意TD-CNN而无需额外适配。因此，我们将TD-PR应用于三种架构与参数量各异的TD-CNN。当TD-PR应用于所有三种TD-CNN基线时，观察到一致的性能提升。音频样本可于演示页面获取。

相关内容

Networking

关注 23

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日