Audio inpainting seeks to restore missing segments in degraded recordings. Previous diffusion-based methods exhibit impaired performance when the missing region is large. We introduce the first approach that applies discrete diffusion over tokenized music representations from a pre-trained audio tokenizer, enabling stable and semantically coherent restoration of long gaps. Our method further incorporates two training approaches: a derivative-based regularization loss that enforces smooth temporal dynamics, and a span-based absorbing transition that provides structured corruption during diffusion. Experiments on the MusicNet and MAESTRO datasets with gaps up to 750 ms show that our approach consistently outperforms strong baselines across range of gap lengths, for gaps of 150 ms and above. This work advances musical audio restoration and introduces new directions for discrete diffusion model training. Visit our project page for examples and code.
翻译:音频修复旨在恢复受损录音中的缺失片段。当缺失区域较大时,现有的基于扩散的方法性能会下降。我们提出了首种方法,将离散扩散应用于预训练音频令牌化器生成的令牌化音乐表示,从而实现对长间隙的稳定且语义连贯的恢复。我们的方法进一步结合了两种训练策略:一种基于导数的正则化损失,用于强制平滑的时间动态;以及一种基于跨度的吸收转移,在扩散过程中提供结构化的损坏。在MusicNet和MAESTRO数据集上进行的实验表明,对于长达750毫秒的间隙,我们的方法在150毫秒及以上的各种间隙长度上均持续优于强基线模型。这项工作推进了音乐音频修复,并为离散扩散模型训练引入了新的方向。请访问我们的项目页面获取示例和代码。