Audio inpainting aims to reconstruct missing segments in corrupted recordings. Previous methods produce plausible reconstructions when the gap length is shorter than about 100\;ms, but the quality decreases for longer gaps. This paper explores recent advancements in deep learning and, particularly, diffusion models, for the task of audio inpainting. The proposed method uses an unconditionally trained generative model, which can be conditioned in a zero-shot fashion for audio inpainting, offering high flexibility to regenerate gaps of arbitrary length. An improved deep neural network architecture based on the constant-Q transform, which allows the model to exploit pitch-equivariant symmetries in audio, is also presented. The performance of the proposed algorithm is evaluated through objective and subjective metrics for the task of reconstructing short to mid-sized gaps. The results of a formal listening test show that the proposed method delivers a comparable performance against state-of-the-art for short gaps, while retaining a good audio quality and outperforming the baselines for the longest gap lengths tested, 150\;ms and 200\;ms. This work helps improve the restoration of sound recordings having fairly long local disturbances or dropouts, which must be reconstructed.
翻译:音频修补旨在重建受损录音中的缺失片段。以往方法在缺口长度小于约100毫秒时能产生合理的重建效果,但较长缺口的质量会下降。本文探讨了深度学习,特别是扩散模型在音频修补任务中的最新进展。所提方法使用无条件训练生成模型,可通过零样本方式条件化为音频修补,具备高度灵活性以再生任意长度的缺口。同时,提出了一种基于常数Q变换的改进深度神经网络架构,该架构使模型能够利用音频中的音高等变对称性。通过客观与主观指标评估了所提算法在重建短至中等长度缺口任务中的性能。正式听力测试结果表明,所提方法在短缺口方面与最先进技术性能相当,在测试的最长缺口(150毫秒和200毫秒)上保持了良好的音频质量并优于基线。本工作有助于改善对存在较长局部干扰或信号丢失的录音的重建。