Music genre classification has been widely studied in past few years for its various applications in music information retrieval. Previous works tend to perform unsatisfactorily, since those methods only use audio content or jointly use audio content and lyrics content inefficiently. In addition, as genres normally co-occur in a music track, it is desirable to capture and model the genre correlations to improve the performance of multi-label music genre classification. To solve these issues, we present a novel multi-modal method leveraging audio-lyrics contrastive loss and two symmetric cross-modal attention, to align and fuse features from audio and lyrics. Furthermore, based on the nature of the multi-label classification, a genre correlations extraction module is presented to capture and model potential genre correlations. Extensive experiments demonstrate that our proposed method significantly surpasses other multi-label music genre classification methods and achieves state-of-the-art result on Music4All dataset.
翻译:音乐流派分类因其在音乐信息检索中的多种应用而受到广泛研究。然而,以往的方法性能不尽如人意,因为这些方法仅利用音频内容,或未能有效联合使用音频内容与歌词内容。此外,由于同一音乐曲目中通常存在多种流派共现,因此捕获并建模流派间的关联性对于提升多标签音乐流派分类性能至关重要。为解决这些问题,我们提出了一种新颖的多模态方法,该方法利用音频-歌词对比损失函数与两个对称的跨模态注意力机制,以对齐并融合音频与歌词的特征。进一步地,基于多标签分类的特性,我们引入了一个流派关联性提取模块,用于捕获并建模潜在的流派关联性。大量实验表明,我们提出的方法显著超越了其他多标签音乐流派分类方法,在Music4All数据集上取得了最先进的性能。