In this work, we present a novel method for music emotion recognition that leverages Large Language Model (LLM) embeddings for label alignment across multiple datasets and zero-shot prediction on novel categories. First, we compute LLM embeddings for emotion labels and apply non-parametric clustering to group similar labels, across multiple datasets containing disjoint labels. We use these cluster centers to map music features (MERT) to the LLM embedding space. To further enhance the model, we introduce an alignment regularization that enables dissociation of MERT embeddings from different clusters. This further enhances the model's ability to better adaptation to unseen datasets. We demonstrate the effectiveness of our approach by performing zero-shot inference on a new dataset, showcasing its ability to generalize to unseen labels without additional training.
翻译:本研究提出了一种新颖的音乐情感识别方法,该方法利用大语言模型(LLM)嵌入实现跨多个数据集的标签对齐以及对新类别的零样本预测。首先,我们计算多个包含互斥标签的数据集中情感标签的LLM嵌入,并应用非参数聚类对相似标签进行分组。随后,我们利用这些聚类中心将音乐特征(MERT)映射到LLM嵌入空间。为进一步增强模型性能,我们引入了一种对齐正则化方法,使不同聚类中的MERT嵌入能够解耦。这进一步提升了模型对未见数据集的适应能力。我们通过在全新数据集上进行零样本推理验证了所提方法的有效性,展示了其无需额外训练即可泛化至未见标签的能力。