Multimodal Pre-training Framework for Sequential Recommendation via Contrastive Learning

Current multimodal sequential recommendation models are often unable to effectively explore and capture correlations among behavior sequences of users and items across different modalities, either neglecting correlations among sequence representations or inadequately capturing associations between multimodal data and sequence data in their representations. To address this problem, we explore multimodal pre-training in the context of sequential recommendation, with the aim of enhancing fusion and utilization of multimodal information. We propose a novel Multimodal Pre-training for Sequential Recommendation (MP4SR) framework, which utilizes contrastive losses to capture the correlation among different modality sequences of users, as well as the correlation among different modality sequences of users and items. MP4SR consists of three key components: 1) multimodal feature extraction, 2) a backbone network, Multimodal Mixup Sequence Encoder (M2SE), and 3) pre-training tasks. After utilizing pre-trained encoders to generate initial multimodal features of items, M2SE adopts a complementary sequence mixup strategy to fuse different modality sequences, and leverages contrastive learning to capture modality interactions at the sequence-to-sequence and sequence-to-item levels. Extensive experiments on four real-world datasets demonstrate that MP4SR outperforms state-of-the-art approaches in both normal and cold-start settings. We further highlight the efficacy of incorporating multimodal pre-training in sequential recommendation representation learning, serving as an effective regularizer and optimizing the parameter space for the recommendation task.

翻译：当前的多模态序列推荐模型往往无法有效探索和捕获用户和物品在不同模态下的行为序列之间的关联，要么忽视了序列表示之间的相关性，要么在其表示中未能充分捕捉多模态数据与序列数据之间的关联。为解决此问题，我们在序列推荐的背景下探索多模态预训练，旨在增强多模态信息的融合与利用。我们提出了一种新颖的面向序列推荐的多模态预训练框架，该框架利用对比损失来捕获用户不同模态序列之间的相关性，以及用户与物品不同模态序列之间的相关性。MP4SR包含三个关键组件：1) 多模态特征提取，2) 骨干网络——多模态混合序列编码器，以及3) 预训练任务。在利用预训练编码器生成物品的初始多模态特征后，M2SE采用互补序列混合策略来融合不同模态序列，并利用对比学习在序列到序列和序列到物品层面捕获模态交互。在四个真实世界数据集上的大量实验表明，MP4SR在常规和冷启动设置下均优于最先进的方法。我们进一步强调了在序列推荐表示学习中融入多模态预训练的有效性，其作为一种有效的正则化器，并为推荐任务优化了参数空间。