It is Never Too Late to Mend: Separate Learning for Multimedia Recommendation

Multimedia recommendation, which incorporates various modalities (e.g., images, texts, etc.) into user or item representation to improve recommendation quality, and self-supervised learning carries multimedia recommendation to a plateau of performance, because of its superior performance in aligning different modalities. However, more and more research finds that aligning all modal representations is suboptimal because it damages the unique attributes of each modal. These studies use subtraction and orthogonal constraints in geometric space to learn unique parts. However, our rigorous analysis reveals the flaws in this approach, such as that subtraction does not necessarily yield the desired modal-unique and that orthogonal constraints are ineffective in user and item high-dimensional representation spaces. To make up for the previous weaknesses, we propose Separate Learning (SEA) for multimedia recommendation, which mainly includes mutual information view of modal-unique and -generic learning. Specifically, we first use GNN to learn the representations of users and items in different modalities and split each modal representation into generic and unique parts. We employ contrastive log-ratio upper bound to minimize the mutual information between the general and unique parts within the same modality, to distance their representations, thus learning modal-unique features. Then, we design Solosimloss to maximize the lower bound of mutual information, to align the general parts of different modalities, thus learning more high-quality modal-generic features. Finally, extensive experiments on three datasets demonstrate the effectiveness and generalization of our proposed framework. The code is available at SEA and the full training record of the main experiment.

翻译：多媒体推荐通过将多种模态（如图像、文本等）融入用户或物品表示以提升推荐质量，而自监督学习因其在模态对齐方面的卓越性能，将多媒体推荐推向了性能瓶颈。然而，越来越多的研究发现，对齐所有模态表示并非最优，因为它会损害各模态的独特属性。现有研究通过在几何空间中采用减法与正交约束来学习独特部分。然而，我们的严格分析揭示了该方法的缺陷，例如减法未必能获得理想的模态独特性，且正交约束在用户与物品的高维表示空间中效果有限。为弥补先前不足，我们提出用于多媒体推荐的分离学习（SEA）框架，其核心包含模态独特性与通用性的互信息视角学习。具体而言，我们首先使用图神经网络学习不同模态下的用户与物品表示，并将各模态表示拆分为通用部分与独特部分。我们采用对比对数比上界来最小化同一模态内通用部分与独特部分间的互信息，以拉远其表示距离，从而学习模态独特特征。随后，我们设计Solosim损失函数来最大化互信息下界，以对齐不同模态的通用部分，进而学习更高质量的模态通用特征。最后，在三个数据集上的大量实验证明了所提框架的有效性与泛化能力。代码与主要实验的完整训练记录已公开于SEA。