In a range of recent works, object-centric architectures have been shown to be suitable for unsupervised scene decomposition in the vision domain. Inspired by these methods we present AudioSlots, a slot-centric generative model for blind source separation in the audio domain. AudioSlots is built using permutation-equivariant encoder and decoder networks. The encoder network based on the Transformer architecture learns to map a mixed audio spectrogram to an unordered set of independent source embeddings. The spatial broadcast decoder network learns to generate the source spectrograms from the source embeddings. We train the model in an end-to-end manner using a permutation invariant loss function. Our results on Libri2Mix speech separation constitute a proof of concept that this approach shows promise. We discuss the results and limitations of our approach in detail, and further outline potential ways to overcome the limitations and directions for future work.
翻译:在近期一系列研究中,以对象为中心的架构已被证明适用于视觉领域的无监督场景分解。受这些方法的启发,我们提出了AudioSlots——一种以槽为中心的生成模型,用于音频领域的盲源分离。AudioSlots采用排列等变编码器和解码器网络构建。基于Transformer架构的编码器网络学习将混合音频频谱图映射为一组无序的独立源嵌入向量。空间广播解码器网络则学习从源嵌入向量生成源频谱图。我们通过排列不变损失函数以端到端方式训练该模型。在Libri2Mix语音分离任务上的实验结果构成了概念验证,表明该方法具有潜力。我们详细讨论了该方法的成果与局限性,并进一步概述了克服局限性的潜在途径及未来研究方向。