In recent studies, diffusion models have shown promise as priors for solving audio inverse problems. These models allow us to sample from the posterior distribution of a target signal given an observed signal by manipulating the diffusion process. However, when separating audio sources of the same type, such as duet singing voices, the prior learned by the diffusion process may not be sufficient to maintain the consistency of the source identity in the separated audio. For example, the singer may change from one to another occasionally. Tackling this problem will be useful for separating sources in a choir, or a mixture of multiple instruments with similar timbre, without acquiring large amounts of paired data. In this paper, we examine this problem in the context of duet singing voices separation, and propose a method to enforce the coherency of singer identity by splitting the mixture into overlapping segments and performing posterior sampling in an auto-regressive manner, conditioning on the previous segment. We evaluate the proposed method on the MedleyVox dataset and show that the proposed method outperforms the naive posterior sampling baseline. Our source code and the pre-trained model are publicly available at https://github.com/yoyololicon/duet-svs-diffusion.
翻译:近期研究表明,扩散模型作为解决音频逆问题的先验方法展现出潜力。这类模型通过调控扩散过程,能够从给定观测信号的条件下对目标信号的后验分布进行采样。然而,在分离同类型音频源(如二重唱歌声)时,扩散过程学习的先验知识可能不足以维持分离音频中源身份的一致性。具体表现为:歌唱者身份可能发生非预期的切换。解决该问题将有助于在不获取大量配对数据的情况下,分离合唱团声部或音色相近的多乐器混合信号。本文以二重唱歌声分离为研究场景,提出一种通过将混合信号分割为重叠片段、并以自回归方式基于前一帧条件进行后验采样的方法,以强化歌唱者身份的一致性。我们在MedleyVox数据集上对方法进行评估,结果表明所提方法优于朴素的后验采样基线。源代码与预训练模型已开源至 https://github.com/yoyololicon/duet-svs-diffusion。