Cinematic audio source separation (CASS) is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue stem (DX), music stem (MX), and effects stem (FX). In practice, however, several edge cases exist as some sound sources do not fit neatly in either of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at https://github.com/kwatcharasupat/source-separation-landing.
翻译:电影音频源分离是音频源分离领域中一个相对新兴的子任务。典型的电影音频源分离设定为一个三音轨问题,其目标是将混合音频分离为对白音轨、音乐音轨和效果音轨。然而在实践中,存在一些边缘情况,因为某些声源无法完全归入这三个音轨中的任何一个,因此在制作中需要使用额外的辅助音轨。一个非常常见的边缘情况是电影音频中的歌唱人声,根据具体的电影语境,它可能属于对白音轨或音乐音轨。在本工作中,我们展示了将专用解码器的Bandit模型和基于查询的单解码器Banquet模型直接扩展至四音轨问题的方案,将非音乐性对白、器乐音乐、歌唱人声和效果声作为独立的音轨进行处理。有趣的是,基于查询的Banquet模型表现优于专用解码器的Bandit模型。我们假设这是由于频带无关的FiLM层在瓶颈处实现了更好的特征对齐。数据集和模型实现将在https://github.com/kwatcharasupat/source-separation-landing 提供。