Multimodal Music Recommendation System using LLMs

Srikar Prabhas Kandagatla,Sreehitha R. Narayana,Chandana Magapu,Swetha Mohan,Shamanth Kuthpadi,Hongjie Chen,Ryan A. Rossi,Franck Dernoncourt,Nesreen Ahmed

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.

翻译：音乐推荐系统通常将歌曲视为不透明的标记，依赖协同交互历史，忽略了语义或声学内容。先前研究探索了大语言模型增强、多模态及文本增强方法在序列推荐中的应用，尽管部分方法部分融合了语义、声学或参与信号，但没有方法能在统一的、基于大语言模型的序列推理框架中联合建模这三个信号，并将推荐锚定于实际的歌曲内容。在本工作中，我们提出一个面向会话式音乐推荐的多模态框架，通过三种互补信号增强LastFM-1K数据集：（1）使用预训练音乐和文本表征模型提取的音频和歌词嵌入；（2）采用MGPHot标注模式生成的大语言模型语义元数据；（3）收听完成率。我们采用E4SRec框架，通过扩展多模态特征及不同物品ID编码器主干（包括SASRec、BERT4Rec和GRU4Rec）进行实现。我们还进一步扩展了大语言模型主干选项，在零样本和微调设置中分别使用LLaMa-2-13B、Qwen2.5-7B-Instruct和LLaMa-3-70B。实验表明，集成基于内容的特征相比仅使用ID的基线模型，在Recall指标上提升最高达95%，在NDCG指标上提升最高达79%。此外，实验显示朴素的多模态融合并不总能带来累加改进，揭示了跨模态整合中的挑战。我们发布了一个面向音乐推荐的大规模多模态基准数据集。