Multimodal recommendation aims to model user and item representations comprehensively with the involvement of multimedia content for effective recommendations. Existing research has shown that it is beneficial for recommendation performance to combine (user- and item-) ID embeddings with multimodal salient features, indicating the value of IDs. However, there is a lack of a thorough analysis of the ID embeddings in terms of feature semantics in the literature. In this paper, we revisit the value of ID embeddings for multimodal recommendation and conduct a thorough study regarding its semantics, which we recognize as subtle features of content and structures. Then, we propose a novel recommendation model by incorporating ID embeddings to enhance the semantic features of both content and structures. Specifically, we put forward a hierarchical attention mechanism to incorporate ID embeddings in modality fusing, coupled with contrastive learning, to enhance content representations. Meanwhile, we propose a lightweight graph convolutional network for each modality to amalgamate neighborhood and ID embeddings for improving structural representations. Finally, the content and structure representations are combined to form the ultimate item embedding for recommendation. Extensive experiments on three real-world datasets (Baby, Sports, and Clothing) demonstrate the superiority of our method over state-of-the-art multimodal recommendation methods and the effectiveness of fine-grained ID embeddings.
翻译:多模态推荐旨在通过融合多媒体内容全面建模用户与物品表示,以实现高效推荐。现有研究表明,将(用户和物品)ID嵌入与多模态显著特征相结合有助于提升推荐性能,这凸显了ID的价值。然而,现有文献缺乏从特征语义角度对ID嵌入进行深入分析。本文重新审视ID嵌入在多模态推荐中的价值,并对其语义展开系统研究——我们将ID嵌入视为内容与结构的细微特征。继而,我们提出一种新型推荐模型,通过整合ID嵌入来增强内容与结构的语义特征。具体而言,我们提出一种层次注意力机制,在模态融合过程中引入ID嵌入,并结合对比学习来增强内容表示;同时,我们为每种模态设计轻量级图卷积网络,融合邻域嵌入与ID嵌入以改进结构表示。最终,将内容表示与结构表示组合形成物品的最终嵌入以用于推荐。在三个真实数据集(Baby、Sports、Clothing)上的大量实验表明,我们的方法优于现有最优多模态推荐方法,并且细粒度ID嵌入具有显著有效性。