Although music is typically multi-label, many works have studied hierarchical music tagging with simplified settings such as single-label data. Moreover, there lacks a framework to describe various joint training methods under the multi-label setting. In order to discuss the above topics, we introduce hierarchical multi-label music instrument classification task. The task provides a realistic setting where multi-instrument real music data is assumed. Various hierarchical methods that jointly train a DNN are summarized and explored in the context of the fusion of deep learning and conventional techniques. For the effective joint training in the multi-label setting, we propose two methods to model the connection between fine- and coarse-level tags, where one uses rule-based grouped max-pooling, the other one uses the attention mechanism obtained in a data-driven manner. Our evaluation reveals that the proposed methods have advantages over the method without joint training. In addition, the decision procedure within the proposed methods can be interpreted by visualizing attention maps or referring to fixed rules.
翻译:尽管音乐通常是多标签的,但许多研究仍以简化的设定(如单标签数据)来探索层次化音乐标注。此外,在多标签场景下,目前缺乏一个统一的框架来描述各类联合训练方法。为探讨上述问题,我们引入层次化多标签乐器分类任务。该任务提供了一个贴近实际的设定,即假设数据为包含多种乐器的真实音乐数据。本文总结并探索了多种联合训练深度神经网络的层次化方法,并融合了深度学习与传统技术。为了实现多标签场景下的有效联合训练,我们提出了两种建模细粒度与粗粒度标签之间关联的方法:一种采用基于规则的组内最大池化,另一种则利用数据驱动方式获得的注意力机制。实验评估表明,所提方法相较于无联合训练的方法具有优势。此外,通过可视化注意力图或参考固定规则,可以解释所提方法中的决策过程。