Sequential Recommendation (SR) aims to predict future user-item interactions based on historical interactions. While many SR approaches concentrate on user IDs and item IDs, the human perception of the world through multi-modal signals, like text and images, has inspired researchers to delve into constructing SR from multi-modal information without using IDs. However, the complexity of multi-modal learning manifests in diverse feature extractors, fusion methods, and pre-trained models. Consequently, designing a simple and universal \textbf{M}ulti-\textbf{M}odal \textbf{S}equential \textbf{R}ecommendation (\textbf{MMSR}) framework remains a formidable challenge. We systematically summarize the existing multi-modal related SR methods and distill the essence into four core components: visual encoder, text encoder, multimodal fusion module, and sequential architecture. Along these dimensions, we dissect the model designs, and answer the following sub-questions: First, we explore how to construct MMSR from scratch, ensuring its performance either on par with or exceeds existing SR methods without complex techniques. Second, we examine if MMSR can benefit from existing multi-modal pre-training paradigms. Third, we assess MMSR's capability in tackling common challenges like cold start and domain transferring. Our experiment results across four real-world recommendation scenarios demonstrate the great potential ID-agnostic multi-modal sequential recommendation. Our framework can be found at: https://github.com/MMSR23/MMSR.
翻译:序列推荐旨在根据历史交互预测未来的用户-物品交互。尽管许多序列推荐方法聚焦于用户ID和物品ID,但人类通过多模态信号(如文本和图像)感知世界的方式,激发了研究者探索从多模态信息中构建无需ID的序列推荐。然而,多模态学习的复杂性体现于多样化的特征提取器、融合方法和预训练模型中。因此,设计一个简单通用的多模态序列推荐框架仍是一项艰巨挑战。我们系统总结了现有与多模态相关的序列推荐方法,将其精髓提炼为四个核心组件:视觉编码器、文本编码器、多模态融合模块和序列架构。沿这些维度,我们剖析模型设计,并回答以下子问题:首先,探究如何从零构建多模态序列推荐,确保其性能在不依赖复杂技术的情况下与现有序列推荐方法持平或超越;其次,检验多模态序列推荐能否受益于现有模态预训练范式;第三,评估多模态序列推荐应对冷启动和领域迁移等常见挑战的能力。我们在四个真实推荐场景下的实验结果,展示了无ID多模态序列推荐的巨大潜力。我们的框架可访问:https://github.com/MMSR23/MMSR。