We present AERO, a audio super-resolution model that processes speech and music signals in the spectral domain. AERO is based on an encoder-decoder architecture with U-Net like skip connections. We optimize the model using both time and frequency domain loss functions. Specifically, we consider a set of reconstruction losses together with perceptual ones in the form of adversarial and feature discriminator loss functions. To better handle phase information the proposed method operates over the complex-valued spectrogram using two separate channels. Unlike prior work which mainly considers low and high frequency concatenation for audio super-resolution, the proposed method directly predicts the full frequency range. We demonstrate high performance across a wide range of sample rates considering both speech and music. AERO outperforms the evaluated baselines considering Log-Spectral Distance, ViSQOL, and the subjective MUSHRA test. Audio samples and code are available at https://pages.cs.huji.ac.il/adiyoss-lab/aero
翻译:我们提出了AERO,一种在频谱域中处理语音和音乐信号的音频超分辨率模型。AERO基于编码器-解码器架构,并采用类似U-Net的跳跃连接。我们利用时域和频域损失函数对模型进行优化。具体而言,我们考虑了一组重构损失,以及以对抗损失和特征判别器损失形式存在的感知损失。为了更好地处理相位信息,所提出的方法通过两个独立通道对复数频谱图进行操作。与先前主要考虑低频与高频拼接进行音频超分辨率的工作不同,所提出的方法直接预测全频率范围。我们展示了在多种采样率下,针对语音和音乐的优异性能。在Log-频谱距离、ViSQOL以及主观MUSHRA测试中,AERO超越了评估的基线方法。音频样本和代码可在https://pages.cs.huji.ac.il/adiyoss-lab/aero获取。