We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous two-stage conversational recommendation systems (recommendation engines and dialogue managers) into a cohesive end-to-end system, (2) effectively utilizing long conversational context for recommendation while maintaining strong performance in extended multi-turn interactions, and (3) generating natural language responses for seamless user interaction. Our qualitative and quantitative evaluation demonstrates that TALKPLAY significantly outperforms unimodal approaches based solely on text or listening history in both recommendation performance and conversational naturalness.
翻译:我们提出了TALKPLAY,一种新型多模态音乐推荐系统,该系统利用大语言模型(LLM)将推荐问题重新构建为令牌生成任务。通过发挥LLM的指令遵循与自然语言生成能力,我们的系统能有效从多样化的用户查询中推荐音乐,同时生成上下文相关的响应。尽管预训练LLM主要针对文本模态设计,但TALKPLAY通过两项关键创新拓展了其应用范围:第一,多模态音乐令牌化器,用于编码音频特征、歌词、元数据、语义标签及播放列表共现信号;第二,词汇扩展机制,支持语言令牌与音乐相关令牌的统一处理与生成。通过将推荐系统直接集成到LLM架构中,TALKPLAY实现了对传统系统的革新:(1)将传统的两阶段对话推荐系统(推荐引擎与对话管理器)统一为内聚的端到端系统;(2)在利用长对话上下文进行推荐的同时,保持扩展多轮交互中的强劲性能;(3)生成自然语言响应以实现无缝用户交互。我们的定性与定量评估表明,在推荐性能与对话自然度方面,TALKPLAY显著优于仅基于文本或收听历史单模态的方法。