Multi-View Graph Convolutional Network for Multimedia Recommendation

Multimedia recommendation has received much attention in recent years. It models user preferences based on both behavior information and item multimodal information. Though current GCN-based methods achieve notable success, they suffer from two limitations: (1) Modality noise contamination to the item representations. Existing methods often mix modality features and behavior features in a single view (e.g., user-item view) for propagation, the noise in the modality features may be amplified and coupled with behavior features. In the end, it leads to poor feature discriminability; (2) Incomplete user preference modeling caused by equal treatment of modality features. Users often exhibit distinct modality preferences when purchasing different items. Equally fusing each modality feature ignores the relative importance among different modalities, leading to the suboptimal user preference modeling. To tackle the above issues, we propose a novel Multi-View Graph Convolutional Network for the multimedia recommendation. Specifically, to avoid modality noise contamination, the modality features are first purified with the aid of item behavior information. Then, the purified modality features of items and behavior features are enriched in separate views, including the user-item view and the item-item view. In this way, the distinguishability of features is enhanced. Meanwhile, a behavior-aware fuser is designed to comprehensively model user preferences by adaptively learning the relative importance of different modality features. Furthermore, we equip the fuser with a self-supervised auxiliary task. This task is expected to maximize the mutual information between the fused multimodal features and behavior features, so as to capture complementary and supplementary preference information simultaneously. Extensive experiments on three public datasets demonstrate the effectiveness of our methods.

翻译：多媒体推荐在近年来受到了广泛关注。该方法基于行为信息和项目多模态信息来建模用户偏好。尽管当前基于图卷积网络的方法取得了显著成功，但它们存在两个局限性：（1）模态噪声对项目表示的污染。现有方法通常将模态特征与行为特征混合在单一视角（如用户-项目视角）中进行传播，这可能导致模态特征中的噪声被放大并与行为特征耦合，最终导致特征可判别性下降；（2）因模态特征同等处理而导致用户偏好建模不完整。用户在购买不同项目时通常表现出不同的模态偏好。平等融合每种模态特征忽视了不同模态间的相对重要性，从而导致次优的用户偏好建模。为解决上述问题，我们提出了一种新颖的多视角图卷积网络用于多媒体推荐。具体而言，为避免模态噪声污染，首先借助项目行为信息对模态特征进行净化。然后，在分离的视角（包括用户-项目视角和项目-项目视角）中分别丰富项目的净化后模态特征与行为特征，从而增强特征的可区分性。同时，设计了一个行为感知融合器，通过自适应学习不同模态特征的相对重要性来全面建模用户偏好。此外，我们为该融合器配备了一个自监督辅助任务，旨在最大化融合后多模态特征与行为特征之间的互信息，从而同时捕获互补和补充的偏好信息。在三个公开数据集上的大量实验证明了我们方法的有效性。