Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue stem, the music stem, and the effects stem from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psycho-acoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with easily detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
翻译:电影音频源分离是音频源分离中一个相对较新的子任务,旨在从混合音频中提取对白音轨、音乐音轨和效果音轨。本研究开发了一种通用化频带分割循环神经网络(Bandsplit RNN)的模型,可处理频率轴的任意完全或过完备划分。我们采用基于心理声学的频率尺度来定义频带,这些频带通过冗余设计以实现更可靠的特征提取。提出了一种融合信噪比与1-范数稀疏性促进特性的损失函数。此外,利用共享编码器架构的信息共享特性,在训练和推理阶段降低计算复杂度,提升难以泛化的声音类别的分离性能,并允许在推理时通过可便捷分离的解码器实现灵活性。我们的最佳模型在Divide and Remaster数据集上达到了当前最优性能,其对白音轨的分离结果甚至超越了理想比值掩模。