Cinematic audio source separation is a relatively new subtask of audio source separation, with the aim of extracting the dialogue stem, the music stem, and the effects stem from their mixture. In this work, we developed a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis. Psycho-acoustically motivated frequency scales were used to inform the band definitions which are now defined with redundancy for more reliable feature extraction. A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed. We additionally exploit the information-sharing property of a common-encoder setup to reduce computational complexity during both training and inference, improve separation performance for hard-to-generalize classes of sounds, and allow flexibility during inference time with easily detachable decoders. Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
翻译:影视音频源分离是音频源分离中一个较新的子任务,旨在从混合音频中提取对白主干、音乐主干和音效主干。本研究开发了一种泛化频带分割循环神经网络(Bandsplit RNN)的模型,适用于频率轴的任意完备或过完备划分。采用基于心理声学的频率尺度指导频带定义,引入冗余机制以提升特征提取的可靠性。提出一种融合信噪比与1-范数稀疏性促进特性的损失函数。进一步利用共享编码器架构的信息共享特性,在训练与推理阶段降低计算复杂度,提升对难泛化声音类别的分离性能,并通过易于分离的解码器实现推理时的灵活调整。最终模型在Divide and Remaster数据集上达到当前最优性能,对白主干的分离表现甚至超越理想比值掩码。