With the development of multimedia applications, multimodal recommendations play an essential role, as they can leverage rich contexts beyond user and item interactions. Existing methods mainly use them to help learn ID features; however, there exist semantic gaps among multimodal content features and ID features. Directly using multimodal information as an auxiliary would lead to misalignment in items' and users' representations. In this paper, we first systematically investigate the misalignment issue in multimodal recommendations, and propose a solution named AlignRec. In AlignRec, the recommendation objective is decomposed into three alignments, namely alignment within contents, alignment between content and categorical ID, and alignment between users and items. Each alignment is characterized by a distinct objective function. To effectively train AlignRec, we propose starting from pre-training the first alignment to obtain unified multimodal features and subsequently training the following two alignments together. As it is essential to analyze whether each multimodal feature helps in training, we design three new classes of metrics to evaluate intermediate performance. Our extensive experiments on three real-world datasets consistently verify the superiority of AlignRec compared to nine baselines. We also find that the multimodal features generated by our framework are better than currently used ones, which are to be open-sourced.
翻译:随着多媒体应用的发展,多模态推荐发挥着重要作用,因为它能利用用户与物品交互之外的丰富上下文信息。现有方法主要将其用于辅助学习ID特征,然而,多模态内容特征与ID特征之间存在语义鸿沟。直接将多模态信息作为辅助手段会导致物品和用户表征的错位。本文首先系统研究了多模态推荐中的错位问题,并提出了一种名为AlignRec的解决方案。在AlignRec中,推荐目标被分解为三种对齐:内容内部对齐、内容与类别ID对齐、以及用户与物品对齐。每种对齐由不同的目标函数表征。为有效训练AlignRec,我们提出先预训练第一种对齐以获得统一的多模态特征,随后共同训练后两种对齐。由于需要分析每种多模态特征是否有助于训练,我们设计了三个新型评估指标来衡量中间性能。在三个真实数据集上的大量实验一致验证了AlignRec相较于九个基准方法的优越性。我们还发现,我们框架生成的多模态特征优于现有特征,这些特征将进行开源。