Cinematic audio source separation (CASS) is a relatively new subtask of audio source separation, concerned with the separation of a mixture into the dialogue, music, and effects stems. To date, only one publicly available dataset exists for CASS, that is, the Divide and Remaster (DnR) dataset, which is currently at version 2. While DnR v2 has been an incredibly useful resource for CASS, several areas of improvement have been identified, particularly through its use in the 2023 Sound Demixing Challenge. In this work, we develop version 3 of the DnR dataset, addressing issues relating to vocal content in non-dialogue stems, loudness distributions, mastering process, and linguistic diversity. In particular, the dialogue stem of DnR v3 includes speech content from more than 30 languages from multiple families including but not limited to the Germanic, Romance, Indo-Aryan, Dravidian, Malayo-Polynesian, and Bantu families. Benchmark results using the Bandit model indicated that training on multilingual data yields significant generalizability to the model even in languages with low data availability. Even in languages with high data availability, the multilingual model often performs on par or better than dedicated models trained on monolingual CASS datasets.
翻译:电影音频源分离是音频源分离领域中相对较新的子任务,其目标是将混合音频分离为对白、音乐和音效三个独立音轨。迄今为止,仅存在一个公开可用的电影音频源分离数据集,即当前版本为2的“分割与重制”数据集。尽管DnR v2已成为该领域极具价值的研究资源,但通过2023年声音分离挑战赛的实际应用,研究者发现该数据集在多个方面仍有改进空间。本研究开发了DnR数据集的第3版,针对非对白音轨中的人声内容、响度分布、母带处理流程及语言多样性等问题进行了系统性优化。特别值得注意的是,DnR v3的对白音轨涵盖了超过30种语言的语音内容,涉及日耳曼语系、罗曼语系、印度-雅利安语系、达罗毗荼语系、马来-波利尼西亚语系和班图语系等多个语系家族。基于Bandit模型的基准测试表明,使用多语言数据进行训练能显著提升模型的泛化能力,即使在数据稀缺的语言中也能保持良好性能。对于数据充足的语言,多语言模型的性能通常与基于单语言电影音频源分离数据集训练的专用模型相当或更优。