Audio inpainting aims to reconstruct missing segments in corrupted recordings. Most of existing methods produce plausible reconstructions when the gap lengths are short, but struggle to reconstruct gaps larger than about 100 ms. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, and is able to regenerate gaps of any size. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps, up to 300 ms. The results of a formal listening test show that the proposed method delivers comparable performance against the compared baselines for short gaps, such as 50 ms, while retaining a good audio quality and outperforming the baselines for wider gaps that are up to 300 ms long. The method presented in this paper can be applied to restoring sound recordings that suffer from severe local disturbances or dropouts, which must be reconstructed.
翻译:音频修复旨在重建受损录音中的缺失片段。现有方法大多能在间隙长度较短时生成合理的重建结果,但难以重建超过约100毫秒的间隙。本文探索了深度学习,特别是扩散模型在音频修复任务中的最新进展。所提出的方法使用无条件训练的生成模型,该模型可通过零样本方式条件化用于音频修复,并能再生任意大小的间隙。本文还提出了一种基于常数Q变换的改进深度神经网络架构,该架构使模型能够利用音频中的音高等变对称性。通过客观和主观指标评估了所提算法在重建短至中等长度间隙(最长300毫秒)时的性能。正式听力测试结果表明,所提方法在50毫秒等短间隙上可对标基线方法,同时在长达300毫秒的较大间隙中保持良好的音频质量并超越基线方法。本文提出的方法可应用于修复遭受严重局部干扰或数据丢失而需重建的声音录音。