In this paper, we focus on multimedia recommender systems using graph convolutional networks (GCNs) where the multimodal features as well as user-item interactions are employed together. Our study aims to exploit multimodal features more effectively in order to accurately capture users' preferences for items. To this end, we point out following two limitations of existing GCN-based multimedia recommender systems: (L1) although multimodal features of interacted items by a user can reveal her preferences on items, existing methods utilize GCN designed to focus only on capturing collaborative signals, resulting in insufficient reflection of the multimodal features in the final user/item embeddings; (L2) although a user decides whether to prefer the target item by considering its multimodal features, existing methods represent her as only a single embedding regardless of the target item's multimodal features and then utilize her embedding to predict her preference for the target item. To address the above issues, we propose a novel multimedia recommender system, named MONET, composed of following two core ideas: modality-embracing GCN (MeGCN) and target-aware attention. Through extensive experiments using four real-world datasets, we demonstrate i) the significant superiority of MONET over seven state-of-the-art competitors (up to 30.32% higher accuracy in terms of recall@20, compared to the best competitor) and ii) the effectiveness of the two core ideas in MONET. All MONET codes are available at https://github.com/Kimyungi/MONET.
翻译:本文聚焦于采用图卷积网络(GCN)的多媒体推荐系统,其中多模态特征与用户-物品交互被共同利用。本研究旨在更有效地挖掘多模态特征,从而精准捕捉用户对物品的偏好。为此,我们指出现有基于GCN的多媒体推荐系统存在以下两个局限:(L1)尽管用户交互物品的多模态特征能揭示其对物品的偏好,但现有方法所采用的GCN仅专注于捕捉协同信号,导致多模态特征在最终用户/物品嵌入中反映不足;(L2)虽然用户会通过考虑目标物品的多模态特征来决定是否偏好该物品,但现有方法仅用单一嵌入表示用户,而忽略目标物品的多模态特征,进而直接利用该嵌入预测用户对目标物品的偏好。针对上述问题,我们提出名为MONET的新型多媒体推荐系统,其包含两个核心思想:模态融合GCN(MeGCN)与目标感知注意力机制。通过在四个真实世界数据集上的广泛实验,我们验证了:i) MONET相较于七个最先进对比方法的显著优越性(在召回率@20指标上相较于最优对比方法最高提升30.32%),以及ii) MONET中两个核心思想的有效性。所有MONET代码已开源至https://github.com/Kimyungi/MONET。