In this paper, we propose a latent-variable generative model called mixture of dynamical variational autoencoders (MixDVAE) to model the dynamics of a system composed of multiple moving sources. A DVAE model is pre-trained on a single-source dataset to capture the source dynamics. Then, multiple instances of the pre-trained DVAE model are integrated into a multi-source mixture model with a discrete observation-to-source assignment latent variable. The posterior distributions of both the discrete observation-to-source assignment variable and the continuous DVAE variables representing the sources content/position are estimated using a variational expectation-maximization algorithm, leading to multi-source trajectories estimation. We illustrate the versatility of the proposed MixDVAE model on two tasks: a computer vision task, namely multi-object tracking, and an audio processing task, namely single-channel audio source separation. Experimental results show that the proposed method works well on these two tasks, and outperforms several baseline methods.
翻译:本文提出一种潜变量生成模型——混合动力变分自编码器(MixDVAE),用于建模由多个移动源组成的系统动力学。首先在单源数据集上预训练DVAE模型以捕获源动力学特性;随后将预训练的DVAE模型的多实例集成到多源混合模型中,并引入离散的观测-源分配潜变量。通过变分期望最大化算法估计离散的观测-源分配变量与表示源内容/位置的连续DVAE变量的后验分布,实现多源轨迹估计。我们通过两项任务验证所提MixDVAE模型的广泛适用性:计算机视觉任务中的多目标跟踪,以及音频处理任务中的单通道音频源分离。实验结果表明,该方法在上述两项任务中均表现优异,且优于多个基线方法。