The online emergence of multi-modal sharing platforms (eg, TikTok, Youtube) is powering personalized recommender systems to incorporate various modalities (eg, visual, textual and acoustic) into the latent user representations. While existing works on multi-modal recommendation exploit multimedia content features in enhancing item embeddings, their model representation capability is limited by heavy label reliance and weak robustness on sparse user behavior data. Inspired by the recent progress of self-supervised learning in alleviating label scarcity issue, we explore deriving self-supervision signals with effectively learning of modality-aware user preference and cross-modal dependencies. To this end, we propose a new Multi-Modal Self-Supervised Learning (MMSSL) method which tackles two key challenges. Specifically, to characterize the inter-dependency between the user-item collaborative view and item multi-modal semantic view, we design a modality-aware interactive structure learning paradigm via adversarial perturbations for data augmentation. In addition, to capture the effects that user's modality-aware interaction pattern would interweave with each other, a cross-modal contrastive learning approach is introduced to jointly preserve the inter-modal semantic commonality and user preference diversity. Experiments on real-world datasets verify the superiority of our method in offering great potential for multimedia recommendation over various state-of-the-art baselines. The implementation is released at: https://github.com/HKUDS/MMSSL.
翻译:多模态分享平台(如TikTok、YouTube)的在线兴起推动了个性化推荐系统将多种模态(如视觉、文本和音频)融入潜在的用户表示。现有关于多模态推荐的研究利用多媒体内容特征增强物品嵌入,但其模型表示能力受限于对标签的严重依赖以及对稀疏用户行为数据的鲁棒性不足。受自监督学习在缓解标签稀缺问题方面最新进展的启发,我们探索了通过有效学习模态感知的用户偏好和跨模态依赖来推导自监督信号的方法。为此,我们提出了一种新的多模态自监督学习方法(MMSSL),以解决两个关键挑战。具体而言,为了刻画用户-物品协同视图与物品多模态语义视图之间的相互依赖关系,我们设计了一种通过对抗扰动进行数据增强的模态感知交互结构学习范式。此外,为了捕捉用户模态感知交互模式之间相互交织的影响,我们引入了一种跨模态对比学习方法,以联合保留模态间语义共性和用户偏好多样性。在真实数据集上的实验验证了我们的方法在多媒体推荐方面优于多种最先进基线模型的巨大潜力。相关实现已发布在:https://github.com/HKUDS/MMSSL。