Recent research in self-supervised contrastive learning of music representations has demonstrated remarkable results across diverse downstream tasks. However, a prevailing trend in existing methods involves representing equally-sized music clips in either waveform or spectrogram formats, often overlooking the intrinsic part-whole hierarchies within music. In our quest to comprehend the bottom-up structure of music, we introduce MART, a hierarchical music representation learning approach that facilitates feature interactions among cropped music clips while considering their part-whole hierarchies. Specifically, we propose a hierarchical part-whole transformer to capture the structural relationships between music clips in a part-whole hierarchy. Furthermore, a hierarchical contrastive learning objective is crafted to align part-whole music representations at adjacent levels, progressively establishing a multi-hierarchy representation space. The effectiveness of our music representation learning from part-whole hierarchies has been empirically validated across multiple downstream tasks, including music classification and cover song identification.
翻译:摘要:近期基于自监督对比学习的音乐表示研究在多种下游任务中展现出显著成效。然而,现有方法的主流趋势是将等长的音乐片段以波形或频谱图形式表示,往往忽略了音乐内部固有的部分-整体层次结构。为深入理解音乐的自底向上结构,我们提出MART——一种分层音乐表示学习方法,该方法在考虑部分-整体层次结构的同时,促进裁剪音乐片段间的特征交互。具体而言,我们设计了分层部分-整体Transformer来捕获部分-整体层次中音乐片段的结构关联性;同时构建了分层对比学习目标,用于对齐相邻层级的部分-整体音乐表示,逐步建立多层次表示空间。基于部分-整体层次结构学到的音乐表示的有效性已在多个下游任务(包括音乐分类与翻唱歌曲识别)中得到实证验证。