Boosting Multimedia Recommendation via Separate Generic and Unique Awareness

Multimedia recommendation, which incorporates various modalities (e.g., images, texts, etc.) into user or item representation to improve recommendation quality, has received widespread attention. Recent methods mainly focus on cross-modal alignment with self-supervised learning to obtain higher quality representation. Despite remarkable performance, we argue that there is still a limitation: completely aligning representation undermines modality-unique information. We consider that cross-modal alignment is right, but it should not be the entirety, as different modalities contain generic information between them, and each modality also contains unique information. Simply aligning each modality may ignore modality-unique features, thus degrading the performance of multimedia recommendation. To tackle the above limitation, we propose a Separate Alignment aNd Distancing framework (SAND) for multimedia recommendation, which concurrently learns both modal-unique and -generic representation to achieve more comprehensive items representation. First, we split each modal feature into generic and unique part. Then, in the alignment module, for better integration of semantic information between different modalities , we design a SoloSimLoss to align generic modalities. Furthermore, in the distancing module, we aim to distance the unique modalities from the modal-generic so that each modality retains its unique and complementary information. In the light of the flexibility of our framework, we give two technical solutions, the more capable mutual information minimization and the simple negative l2 distance. Finally, extensive experimental results on three popular datasets demonstrate the effectiveness and generalization of our proposed framework.

翻译：多媒体推荐通过将多种模态（如图像、文本等）融入用户或物品表征以提升推荐质量，已获得广泛关注。现有方法主要关注基于自监督学习的跨模态对齐，以获取更高质量的表征。尽管性能显著，我们认为仍存在一个局限：完全对齐表征会损害模态独特信息。我们认为跨模态对齐是正确的，但不应是全部，因为不同模态间既包含通用信息，每个模态也包含独特信息。简单对齐各模态可能忽略模态独特特征，从而降低多媒体推荐性能。为解决上述局限，我们提出一个用于多媒体推荐的分离对齐与距离化框架（SAND），该框架同时学习模态独特与通用表征，以获得更全面的物品表征。首先，我们将每个模态特征分割为通用部分与独特部分。随后，在对齐模块中，为更好地整合不同模态间的语义信息，我们设计了SoloSimLoss以对齐通用模态。此外，在距离化模块中，我们旨在使独特模态远离模态通用部分，从而使每个模态保留其独特且互补的信息。鉴于我们框架的灵活性，我们提供了两种技术方案：能力更强的互信息最小化方法以及简单的负L2距离方法。最后，在三个流行数据集上的大量实验结果证明了我们提出框架的有效性与泛化能力。