Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue, music, and effects stems from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psychoacoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
翻译:电影音频源分离是音频源分离的一个较新子任务,旨在从混合音频中提取对话、音乐和音效主干。本文开发了一种将频带分割循环神经网络推广到频率轴任意完全或过度划分的模型。采用心理声学启发的频率尺度来定义带定义,这些定义现在具有冗余性以实现更可靠的特征提取。提出了一种基于信噪比和1-范数稀疏促进特性的损失函数。我们进一步利用共用编码器设置的信息共享特性,在训练和推理过程中降低计算复杂度,提升对难以泛化的声音类别的分离性能,并允许推理时通过可分离解码器实现灵活性。我们的最佳模型在Divide and Remaster数据集上取得了当前最优性能,其中对话主干的分离效果超过理想比值掩码。