MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation

With the increasing multimedia information, multimodal recommendation has received extensive attention. It utilizes multimodal information to alleviate the data sparsity problem in recommendation systems, thus improving recommendation accuracy. However, the reliance on labeled data severely limits the performance of multimodal recommendation models. Recently, self-supervised learning has been used in multimodal recommendations to mitigate the label sparsity problem. Nevertheless, the state-of-the-art methods cannot avoid the modality noise when aligning multimodal information due to the large differences in the distributions of different modalities. To this end, we propose a Multi-level sElf-supervised learNing for mulTimOdal Recommendation (MENTOR) method to address the label sparsity problem and the modality alignment problem. Specifically, MENTOR first enhances the specific features of each modality using the graph convolutional network (GCN) and fuses the visual and textual modalities. It then enhances the item representation via the item semantic graph for all modalities, including the fused modality. Then, it introduces two multilevel self-supervised tasks: the multilevel cross-modal alignment task and the general feature enhancement task. The multilevel cross-modal alignment task aligns each modality under the guidance of the ID embedding from multiple levels while maintaining the historical interaction information. The general feature enhancement task enhances the general feature from both the graph and feature perspectives to improve the robustness of our model. Extensive experiments on three publicly available datasets demonstrate the effectiveness of our method. Our code is publicly available at https://github.com/Jinfeng-Xu/MENTOR.

翻译：随着多媒体信息的日益丰富，多模态推荐系统受到广泛关注。这类方法通过利用多模态信息缓解推荐系统中的数据稀疏问题，从而提升推荐准确性。然而，对标注数据的依赖严重制约了多模态推荐模型的表现。近年来，自监督学习已被用于多模态推荐以缓解标签稀疏问题。但现有最先进方法在对齐多模态信息时，难以避免因不同模态分布差异导致的模态噪声。为此，我们提出一种面向多模态推荐的多级自监督学习方法（MENTOR），以解决标签稀疏问题与模态对齐问题。具体而言，MENTOR首先利用图卷积网络增强各模态的特定特征，并融合视觉与文本模态；随后通过项目语义图增强所有模态（包括融合模态）的项目表征。在此基础上，该方法引入两种多级自监督任务：多级跨模态对齐任务与通用特征增强任务。前者在ID嵌入的多级引导下对齐各模态，同时保持历史交互信息；后者从图与特征两个维度增强通用特征，以提升模型鲁棒性。在三个公开数据集上的大量实验证明了本方法的有效性。我们的代码已开源至https://github.com/Jinfeng-Xu/MENTOR。