Sequential Recommender Systems (SRS) aim to predict users' next interaction based on their historical behaviors, while still facing the challenge of data sparsity. With the rapid advancement of Multimodal Large Language Models (MLLMs), leveraging their multimodal understanding capabilities to enrich item semantic representation has emerged as an effective enhancement strategy for SRS. However, existing MLLM-enhanced recommendation methods still suffer from two key limitations. First, they struggle to effectively align multimodal representations, leading to suboptimal utilization of semantic information across modalities. Second, they often overly rely on MLLM-generated content while overlooking the fine-grained semantic cues contained in the original textual data of items. To address these issues, we propose a Dual-view MLLM-based Enhancing framework for multimodal Sequential Recommendation (DMESR). For the misalignment issue, we employ a contrastive learning mechanism to align the cross-modal semantic representations generated by MLLMs. For the loss of fine-grained semantics, we introduce a cross-attention fusion module that integrates the coarse-grained semantic knowledge obtained from MLLMs with the fine-grained original textual semantics. Finally, these two fused representations can be seamlessly integrated into the downstream sequential recommendation models. Extensive experiments conducted on three real-world datasets and three popular sequential recommendation architectures demonstrate the superior effectiveness and generalizability of our proposed approach.
翻译:序列推荐系统(SRS)旨在根据用户的历史行为预测其下一次交互,但仍面临数据稀疏性的挑战。随着多模态大语言模型(MLLMs)的快速发展,利用其多模态理解能力来丰富物品语义表示,已成为增强SRS的一种有效策略。然而,现有的基于MLLM的增强推荐方法仍存在两个关键局限。首先,它们难以有效对齐多模态表示,导致跨模态语义信息利用不足。其次,它们往往过度依赖MLLM生成的内容,而忽视了物品原始文本数据中包含的细粒度语义线索。为解决这些问题,我们提出了一种基于双视角MLLM的增强型多模态序列推荐框架(DMESR)。针对模态未对齐问题,我们采用对比学习机制来对齐MLLM生成的跨模态语义表示。针对细粒度语义丢失问题,我们引入了一个交叉注意力融合模块,将MLLM获取的粗粒度语义知识与细粒度的原始文本语义进行融合。最终,这两种融合后的表示可以无缝集成到下游的序列推荐模型中。在三个真实世界数据集和三种主流序列推荐架构上进行的大量实验表明,我们所提方法具有卓越的有效性和泛化能力。