Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up structure of music, we introduce MART, a hierarchical music representation learning approach that facilitates feature interactions among cropped music clips while considering their part-whole hierarchies. Specifically, we propose a hierarchical part-whole transformer to capture the structural relationships between music clips in a part-whole hierarchy. Furthermore, a hierarchical contrastive learning objective is crafted to align part-whole music representations at adjacent levels, progressively establishing a multi-hierarchy representation space. The effectiveness of our music representation learning from part-whole hierarchies has been empirically validated across multiple downstream tasks, including music classification and cover song identification.
翻译:近期自监督对比学习在音乐表示方面的研究已在多种下游任务中展现出显著成果。然而,现有方法的主流趋势是将等长的音乐片段以波形或声谱图形式表示,往往忽略了音乐中固有的部分-整体层级结构。为探索音乐的由下而上构建机制,我们提出MART——一种分层音乐表示学习方法,该方法在考虑音乐片段部分-整体层级关系的同时促进裁剪片段间的特征交互。具体而言,我们设计了分层部分-整体Transformer来捕捉音乐片段在部分-整体层级中的结构关系。此外,我们构建了分层对比学习目标,用于对齐相邻层级的部分-整体音乐表示,逐步建立多层级表示空间。这种基于部分-整体层级的音乐表示学习有效性已在多类下游任务(包括音乐分类和翻唱歌曲识别)中得到实证验证。