Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at https://github.com/kwatcharasupat/query-bandit.
翻译:尽管近年来音频源分离的多个子任务取得了显著进展,但很少有音乐源分离系统支持超越四声部(人声、鼓、贝斯及其他,简称VDBO)设置的分离。在目前极少数支持超越此设置的源分离系统中,大多数仍依赖于不灵活的解码器设置,只能支持固定的预定义声部集。在这些不灵活的系统中增加声部支持相应地需要增加计算复杂度,使得这些系统对于长尾乐器的扩展在计算上变得不可行。在本工作中,我们提出了Banquet系统,该系统仅使用一个解码器即可实现多声部的源分离。我们将一个频带分割源分离模型扩展为基于查询的设置,并与一个音乐乐器识别PaSST模型协同工作。在MoisesDB数据集上,仅包含2490万个可训练参数的Banquet,在VDBO声部上的性能接近了显著更复杂的六声部Hybrid Transformer Demucs的水平,并且在吉他和钢琴声部上超越了它。这种基于查询的设置允许分离窄类乐器(如纯净的原声吉他),并可以成功应用于提取不太常见的声部(如簧片乐器和管风琴)。实现代码可在https://github.com/kwatcharasupat/query-bandit获取。