Cross-Modal Attention Network with Dual Graph Learning in Multimodal Recommendation

Multimedia recommendation systems leverage user-item interactions and multimodal information to capture user preferences, enabling more accurate and personalized recommendations. Despite notable advancements, existing approaches still face two critical limitations: first, shallow modality fusion often relies on simple concatenation, failing to exploit rich synergic intra- and inter-modal relationships; second, asymmetric feature treatment-where users are only characterized by interaction IDs while items benefit from rich multimodal content-hinders the learning of a shared semantic space. To address these issues, we propose a Cross-modal Recursive Attention Network with dual graph Embedding (CRANE). To tackle shallow fusion, we design a core Recursive Cross-Modal Attention (RCA) mechanism that iteratively refines modality features based on cross-correlations in a joint latent space, effectively capturing high-order intra- and inter-modal dependencies. For symmetric multimodal learning, we explicitly construct users' multimodal profiles by aggregating features of their interacted items. Furthermore, CRANE integrates a symmetric dual-graph framework-comprising a heterogeneous user-item interaction graph and a homogeneous item-item semantic graph-unified by a self-supervised contrastive learning objective to fuse behavioral and semantic signals. Despite these complex modeling capabilities, CRANE maintains high computational efficiency. Theoretical and empirical analyses confirm its scalability and high practical efficiency, achieving faster convergence on small datasets and superior performance ceilings on large-scale ones. Comprehensive experiments on four public real-world datasets validate an average 5% improvement in key metrics over state-of-the-art baselines.

翻译：多媒体推荐系统利用用户-物品交互与多模态信息来捕捉用户偏好，从而实现更精准的个性化推荐。尽管已有显著进展，现有方法仍面临两个关键局限：其一，浅层模态融合通常依赖简单拼接，未能充分利用丰富的模态内与模态间协同关系；其二，非对称特征处理——用户仅通过交互ID表征，而物品则受益于丰富的多模态内容——阻碍了共享语义空间的学习。为解决这些问题，我们提出一种基于双图嵌入的跨模态递归注意力网络（CRANE）。针对浅层融合问题，我们设计了核心的递归跨模态注意力机制，该机制在联合潜在空间中基于跨模态相关性迭代优化模态特征，有效捕获高阶模态内与模态间依赖关系。为实现对称多模态学习，我们通过聚合用户交互物品的特征显式构建用户多模态画像。此外，CRANE集成了对称双图框架——包含异构的用户-物品交互图与同构的物品-物品语义图——并通过自监督对比学习目标统一融合行为信号与语义信号。尽管具备复杂建模能力，CRANE仍保持较高的计算效率。理论与实证分析证实了其可扩展性与实际高效性，在小型数据集上实现更快收敛，在大型数据集上达到更优性能上限。在四个公开真实数据集上的综合实验表明，其关键指标平均优于当前最先进基线方法约5%。