Music genre classification has been widely studied in past few years for its various applications in music information retrieval. Previous works tend to perform unsatisfactorily, since those methods only use audio content or jointly use audio content and lyrics content inefficiently. In addition, as genres normally co-occur in a music track, it is desirable to capture and model the genre correlations to improve the performance of multi-label music genre classification. To solve these issues, we present a novel multi-modal method leveraging audio-lyrics contrastive loss and two symmetric cross-modal attention, to align and fuse features from audio and lyrics. Furthermore, based on the nature of the multi-label classification, a genre correlations extraction module is presented to capture and model potential genre correlations. Extensive experiments demonstrate that our proposed method significantly surpasses other multi-label music genre classification methods and achieves state-of-the-art result on Music4All dataset.
翻译:音乐风格分类因其在音乐信息检索中的多种应用,在过去几年中受到广泛研究。以往的工作往往表现不佳,因为这些方法要么仅使用音频内容,要么低效地联合使用音频内容与歌词内容。此外,由于一首音乐作品中通常存在多种风格并存的情况,因此有必要捕捉并建模风格间的关联性,以提升多标签音乐风格分类的性能。为解决这些问题,我们提出了一种新颖的多模态方法,利用音频-歌词对比损失和两种对称的跨模态注意力机制,来对齐并融合音频与歌词的特征。此外,基于多标签分类的特性,我们设计了一个风格关联提取模块,用于捕捉和建模潜在的风格关联性。大量实验表明,我们提出的方法显著超越了其他多标签音乐风格分类方法,并在Music4All数据集上达到了最先进的性能。