This paper presents CQT-Diff, a data-driven generative audio model that can, once trained, be used for solving various different audio inverse problems in a problem-agnostic setting. CQT-Diff is a neural diffusion model with an architecture that is carefully constructed to exploit pitch-equivariant symmetries in music. This is achieved by preconditioning the model with an invertible Constant-Q Transform (CQT), whose logarithmically-spaced frequency axis represents pitch equivariance as translation equivariance. The proposed method is evaluated with objective and subjective metrics in three different and varied tasks: audio bandwidth extension, inpainting, and declipping. The results show that CQT-Diff outperforms the compared baselines and ablations in audio bandwidth extension and, without retraining, delivers competitive performance against modern baselines in audio inpainting and declipping. This work represents the first diffusion-based general framework for solving inverse problems in audio processing.
翻译:本文提出CQT-Diff——一种数据驱动的生成式音频模型,该模型在训练完成后可在问题无关的设置下处理多种不同的音频逆问题。CQT-Diff是一种神经网络扩散模型,其架构经过精心设计,能够充分利用音乐中音高等变对称性。这一目标通过使用可逆常Q变换(Constant-Q Transform, CQT)对模型进行预处理实现,其对数分布频率轴将音高等变性转化为平移等变性。本文在三种不同任务(音频带宽扩展、音频修补和音频去削波)中采用客观与主观指标对所提方法进行评价。实验结果表明,在音频带宽扩展任务中,CQT-Diff性能优于对比基线和消融实验模型;并且在无需重新训练的情况下,该模型在音频修补和去削波任务中展现出与现有先进方法相竞争的性能。本工作首次提出了基于扩散模型的音频处理逆问题通用求解框架。