In recent studies, diffusion models have shown promise as priors for solving audio inverse problems. These models allow us to sample from the posterior distribution of a target signal given an observed signal by manipulating the diffusion process. However, when separating audio sources of the same type, such as duet singing voices, the prior learned by the diffusion process may not be sufficient to maintain the consistency of the source identity in the separated audio. For example, the singer may change from one to another occasionally. Tackling this problem will be useful for separating sources in a choir, or a mixture of multiple instruments with similar timbre, without acquiring large amounts of paired data. In this paper, we examine this problem in the context of duet singing voices separation, and propose a method to enforce the coherency of singer identity by splitting the mixture into overlapping segments and performing posterior sampling in an auto-regressive manner, conditioning on the previous segment. We evaluate the proposed method on the MedleyVox dataset and show that the proposed method outperforms the naive posterior sampling baseline. Our source code and the pre-trained model are publicly available at https://github.com/iamycy/duet-svs-diffusion.
翻译:在近期研究中,扩散模型已展现出作为先验知识解决音频逆问题的潜力。这些模型能够通过对扩散过程进行调控,实现从给定观测信号条件下目标信号的后验分布中采样。然而,当分离相同类型的音频源(例如二重唱人声)时,扩散过程学习到的先验知识可能不足以维持分离音频中声源身份的一致性,例如可能出现歌手身份偶然切换的现象。解决该问题将有助于分离合唱团声源或音色相似的多乐器混合音频,且无需获取大量配对数据。本文以二重唱人声分离为研究背景,提出一种通过将混合音频分割为重叠片段、以前一片段为条件进行自回归式后验采样的方法,以增强歌手身份的一致性。我们在MedleyVox数据集上评估所提方法,结果表明该方法优于朴素后验采样基线。我们的源代码与预训练模型已公开于https://github.com/iamycy/duet-svs-diffusion。