A Unified Graph Transformer for Overcoming Isolations in Multi-modal Recommendation

With the rapid development of online multimedia services, especially in e-commerce platforms, there is a pressing need for personalised recommendation systems that can effectively encode the diverse multi-modal content associated with each item. However, we argue that existing multi-modal recommender systems typically use isolated processes for both feature extraction and modality modelling. Such isolated processes can harm the recommendation performance. Firstly, an isolated extraction process underestimates the importance of effective feature extraction in multi-modal recommendations, potentially incorporating non-relevant information, which is harmful to item representations. Second, an isolated modality modelling process produces disjointed embeddings for item modalities due to the individual processing of each modality, which leads to a suboptimal fusion of user/item representations for effective user preferences prediction. We hypothesise that the use of a unified model for addressing both aforementioned isolated processes will enable the consistent extraction and cohesive fusion of joint multi-modal features, thereby enhancing the effectiveness of multi-modal recommender systems. In this paper, we propose a novel model, called Unified Multi-modal Graph Transformer (UGT), which firstly leverages a multi-way transformer to extract aligned multi-modal features from raw data for top-k recommendation. Subsequently, we build a unified graph neural network in our UGT model to jointly fuse the user/item representations with their corresponding multi-modal features. Using the graph transformer architecture of our UGT model, we show that the UGT model can achieve significant effectiveness gains, especially when jointly optimised with the commonly-used multi-modal recommendation losses.

翻译：随着在线多媒体服务的快速发展，尤其是在电子商务平台中，迫切需要能够有效编码每个项目关联的多样化多模态内容的个性化推荐系统。然而，我们认为现有的多模态推荐系统通常在特征提取和模态建模两个环节均采用孤立流程。这种孤立流程会损害推荐性能。首先，孤立的提取过程低估了有效特征提取在多模态推荐中的重要性，可能引入非相关信息，这对项目表示是有害的。其次，孤立的模态建模过程由于对每个模态进行单独处理，导致项目模态的嵌入表示相互割裂，进而使得用户/项目表示在预测用户偏好时的融合效果欠佳。我们假设，采用统一模型来处理上述两个孤立过程，将能够实现联合多模态特征的一致提取与协同融合，从而提升多模态推荐系统的效能。本文提出一种名为统一多模态图Transformer（UGT）的新模型，该模型首先利用多路Transformer从原始数据中提取对齐的多模态特征以进行top-k推荐。随后，我们在UGT模型中构建统一的图神经网络，将用户/项目表示与其对应的多模态特征进行联合融合。通过UGT模型的图Transformer架构，我们证明该模型能够显著提升推荐效能，尤其是在与常用的多模态推荐损失函数联合优化时。