The main idea of multimodal recommendation is the rational utilization of the item's multimodal information to improve the recommendation performance. Previous works directly integrate item multimodal features with item ID embeddings, ignoring the inherent semantic relations contained in the multimodal features. In this paper, we propose a novel and effective aTtention-guided Multi-step FUsion Network for multimodal recommendation, named TMFUN. Specifically, our model first constructs modality feature graph and item feature graph to model the latent item-item semantic structures. Then, we use the attention module to identify inherent connections between user-item interaction data and multimodal data, evaluate the impact of multimodal data on different interactions, and achieve early-step fusion of item features. Furthermore, our model optimizes item representation through the attention-guided multi-step fusion strategy and contrastive learning to improve recommendation performance. The extensive experiments on three real-world datasets show that our model has superior performance compared to the state-of-the-art models.
翻译:多模态推荐的核心思想在于合理利用物品的多模态信息以提升推荐性能。现有方法直接将物品多模态特征与物品ID嵌入进行整合,忽略了多模态特征中蕴含的语义关联。本文提出一种新颖且有效的注意力引导多步融合网络(TMFUN),用于多模态推荐。具体而言,该模型首先构建模态特征图和物品特征图,以建模潜在的物品间语义结构;随后利用注意力模块识别用户-物品交互数据与多模态数据之间的内在关联,评估多模态数据对不同交互的影响,并实现物品特征的早期融合;此外,模型通过注意力引导的多步融合策略与对比学习优化物品表示,从而提升推荐性能。在三个真实数据集上的大量实验表明,该模型的性能优于现有最先进模型。