Variational autoencoders (VAEs) are among leading approaches to address the problem of learning disentangled representations. Typically a single VAE is used and disentangled representations are sought within its single continuous latent space. In this paper, we propose and provide a proof of concept for a novel Multi-Stream Variational Autoencoder (MS-VAE) that achieves disentanglement of sources by combining discrete and continuous latents. The discrete latents are used in an explicit source combination model, that superimposes a set of sources as part of the MS-VAE decoder. We formally define the MS-VAE approach, derive its inference and learning equations, and numerically investigate its principled functionality. The MS-VAE model is very flexible and can be trained using little supervision (we use fully unsupervised learning after pretraining with some labels). In our numerical experiments, we explored the ability of the MS-VAE approach in separating both superimposed hand-written digits as well as sound sources. For the former task we used superimposed MNIST digits (an increasingly common benchmark). For sound separation, our experiments focused on the task of speaker diarization in a recording conversation between two speakers. In all cases, we observe a clear separation of sources and competitive performance after training. For digit superpositions, performance is particularly competitive in complex mixtures (e.g., three and four digits). For the speaker diarization task, we observe an especially low rate of missed speakers and a more precise speaker attribution. Numerical experiments confirm the flexibility of the approach across varying amounts of supervision, and we observed high performance, e.g., when using just 10% of the labels for pretraining.
翻译:变分自编码器(VAEs)是解决解耦表征学习问题的主要方法之一。通常使用单个VAE,并在其单一连续潜空间中寻求解耦表征。本文提出了一种新型多流变分自编码器(MS-VAE),并提供了其概念验证,该方法通过结合离散和连续潜变量实现源的解耦。离散潜变量被用于一个显式的源组合模型中,该模型作为MS-VAE解码器的一部分对一组源进行叠加。我们正式定义了MS-VAE方法,推导了其推断与学习方程,并通过数值实验验证了其原理性功能。MS-VAE模型非常灵活,可在少量监督下训练(我们在使用部分标签预训练后采用完全无监督学习)。在数值实验中,我们探究了MS-VAE方法在分离叠加手写数字及声源方面的能力。对于前一项任务,我们使用了叠加的MNIST数字(一个日益常见的基准测试)。对于声音分离,实验聚焦于双人对话录音中的说话人日记化任务。在所有这些案例中,我们观察到明确的源分离效果,且训练后性能具有竞争力。对于数字叠加任务,在复杂混合场景(如三至四个数字)中性能尤为突出。在说话人日记化任务中,我们观察到极低的漏检率以及更精准的说话人归属。数值实验证实了该方法在不同监督程度下的灵活性,例如,仅使用10%标签进行预训练时,我们已观察到高性能表现。