Abstract: Cover song identification (CSI) focuses on finding the same music with different versions in reference anchors given a query track. In this paper, we propose a novel system named CoverHunter that overcomes the shortcomings of existing detection schemes by exploring richer features with refined attention and alignments. CoverHunter contains three key modules: 1) A convolution-augmented transformer (i.e., Conformer) structure that captures both local and global feature interactions in contrast to previous methods mainly relying on convolutional neural networks; 2) An attention-based time pooling module that further exploits the attention in the time dimension; 3) A novel coarse-to-fine training scheme that first trains a network to roughly align the song chunks and then refines the network by training on the aligned chunks. At the same time, we also summarize some important training tricks used in our system that help achieve better results. Experiments on several standard CSI datasets show that our method significantly improves over state-of-the-art methods with an embedding size of 128 (2.3% on SHS100K-TEST and 17.7% on DaTacos).
翻译:翻唱歌曲识别(CSI)旨在给定查询曲目后,从参考锚点中找出相同音乐的不同版本。本文提出名为CoverHunter的新系统,通过探索更丰富的特征并采用精细化注意力与对齐机制,克服了现有检测方案的缺陷。该系统包含三个关键模块:1)卷积增强型Transformer(即Conformer)结构,相较以往主要依赖卷积神经网络的方法,可捕获局部与全局特征交互;2)基于注意力的时间池化模块,进一步挖掘时间维度上的注意力;3)新颖的粗到细训练方案:先训练网络对歌曲片段进行粗略对齐,再通过对齐后的片段对网络进行精细化训练。同时,本文还总结了系统中可达成更优效果的重要训练技巧。在多个标准CSI数据集上的实验表明,本方法在嵌入尺寸为128的情况下,相较现有最优方法获得显著提升(SHS100K-TEST数据集提升2.3%,DaTacos数据集提升17.7%)。