Jointist: Simultaneous Improvement of Multi-instrument Transcription and Music Source Separation via Joint Training

In this paper, we introduce Jointist, an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip. Jointist consists of an instrument recognition module that conditions the other two modules: a transcription module that outputs instrument-specific piano rolls, and a source separation module that utilizes instrument information and transcription results. The joint training of the transcription and source separation modules serves to improve the performance of both tasks. The instrument module is optional and can be directly controlled by human users. This makes Jointist a flexible user-controllable framework. Our challenging problem formulation makes the model highly useful in the real world given that modern popular music typically consists of multiple instruments. Its novelty, however, necessitates a new perspective on how to evaluate such a model. In our experiments, we assess the proposed model from various aspects, providing a new evaluation perspective for multi-instrument transcription. Our subjective listening study shows that Jointist achieves state-of-the-art performance on popular music, outperforming existing multi-instrument transcription models such as MT3. %We also argue that transcription models can be used as a preprocessing module for other music analysis tasks. We conducted experiments on several downstream tasks and found that the proposed method improved transcription by more than 1 percentage points (ppt.), source separation by 5 SDR, downbeat detection by 1.8 ppt., chord recognition by 1.4 ppt., and key estimation by 1.4 ppt., when utilizing transcription results obtained from Jointist.

翻译：摘要：本文提出Jointist——一种感知乐器的多乐器框架，能够从音频片段中转录、识别并分离多种乐器。Jointist包含一个乐器识别模块，该模块为另外两个模块提供条件：输出乐器特定钢琴卷帘的转录模块，以及利用乐器信息与转录结果的源分离模块。转录与源分离模块的联合训练可提升两项任务的性能。乐器模块为可选组件，可直接由人类用户控制，这使得Jointist成为灵活且用户可控的框架。鉴于现代流行音乐通常包含多种乐器，我们具有挑战性的问题公式化使该模型在实际应用中极具价值。然而，其创新性要求以全新视角评估此类模型。实验中，我们从多维度评估所提模型，为多乐器转录提供了新的评价视角。主观听音测试表明，Jointist在流行音乐上达到了最先进性能，优于MT3等现有模型。%此外，我们认为转录模型可作为其他音乐分析任务的预处理模块。我们在多项下游任务上进行实验后发现，利用Jointist获得的转录结果，所提方法使转录性能提升超过1个百分点，源分离提升5个SDR，强拍检测提升1.8个百分点，和弦识别提升1.4个百分点，调性估计提升1.4个百分点。