Harnessing Multimodal Large Language Models for Multimodal Sequential Recommendation

Recent advances in Large Language Models (LLMs) have demonstrated significant potential in the field of Recommendation Systems (RSs). Most existing studies have focused on converting user behavior logs into textual prompts and leveraging techniques such as prompt tuning to enable LLMs for recommendation tasks. Meanwhile, research interest has recently grown in multimodal recommendation systems that integrate data from images, text, and other sources using modality fusion techniques. This introduces new challenges to the existing LLM-based recommendation paradigm which relies solely on text modality information. Moreover, although Multimodal Large Language Models (MLLMs) capable of processing multi-modal inputs have emerged, how to equip MLLMs with multi-modal recommendation capabilities remains largely unexplored. To this end, in this paper, we propose the Multimodal Large Language Model-enhanced Multimodaln Sequential Recommendation (MLLM-MSR) model. To capture the dynamic user preference, we design a two-stage user preference summarization method. Specifically, we first utilize an MLLM-based item-summarizer to extract image feature given an item and convert the image into text. Then, we employ a recurrent user preference summarization generation paradigm to capture the dynamic changes in user preferences based on an LLM-based user-summarizer. Finally, to enable the MLLM for multi-modal recommendation task, we propose to fine-tune a MLLM-based recommender using Supervised Fine-Tuning (SFT) techniques. Extensive evaluations across various datasets validate the effectiveness of MLLM-MSR, showcasing its superior ability to capture and adapt to the evolving dynamics of user preferences.

翻译：近年来，大语言模型在推荐系统领域展现出巨大潜力。现有研究大多聚焦于将用户行为日志转化为文本提示，并利用提示调优等技术使大语言模型适应推荐任务。与此同时，通过模态融合技术整合图像、文本等多源数据的多模态推荐系统日益受到关注，这对仅依赖文本模态信息的现有大语言模型推荐范式提出了新挑战。尽管能够处理多模态输入的多模态大语言模型已经出现，但如何赋予其多模态推荐能力仍亟待探索。为此，本文提出多模态大语言模型增强的多模态序列推荐模型。为捕捉动态用户偏好，我们设计了两阶段用户偏好归纳方法：首先基于MLLM的物品归纳器提取物品图像特征并将其转化为文本描述；随后通过循环式用户偏好归纳生成范式，基于LLM用户归纳器捕捉用户偏好的动态演变。最后，我们采用监督微调技术对基于MLLM的推荐器进行微调，使其具备多模态推荐能力。跨多个数据集的实验验证了该模型的有效性，其展现出的卓越性能表明该模型能精准捕捉并适应持续演变的用户偏好动态。