RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment

Multimodal recommendation systems typically integrates user behavior with multimodal data from items, thereby capturing more accurate user preferences. Concurrently, with the rise of large models (LMs), multimodal recommendation is increasingly leveraging their strengths in semantic understanding and contextual reasoning. However, LM representations are inherently optimized for general semantic tasks, while recommendation models rely heavily on sparse user/item unique identity (ID) features. Existing works overlook the fundamental representational divergence between large models and recommendation systems, resulting in incompatible multimodal representations and suboptimal recommendation performance. To bridge this gap, we propose RecGOAT, a novel yet simple dual semantic alignment framework for LLM-enhanced multimodal recommendation, which offers theoretically guaranteed alignment capability. RecGOAT first employs graph attention networks to enrich collaborative semantics by modeling item-item, user-item, and user-user relationships, leveraging user/item LM representations and interaction history. Furthermore, we design a dual-granularity progressive multimodality-ID alignment framework, which achieves instance-level and distribution-level semantic alignment via cross-modal contrastive learning (CMCL) and optimal adaptive transport (OAT), respectively. Theoretically, we demonstrate that the unified representations derived from our alignment framework exhibit superior semantic consistency and comprehensiveness. Extensive experiments on three public benchmarks show that our RecGOAT achieves state-of-the-art performance, empirically validating our theoretical insights. Additionally, the deployment on a large-scale online advertising platform confirms the model's effectiveness and scalability in industrial recommendation scenarios. Code available at https://github.com/6lyc/RecGOAT-LLM4Rec.

翻译：多模态推荐系统通常将用户行为与物品的多模态数据相结合，从而更准确地捕捉用户偏好。与此同时，随着大模型（LMs）的兴起，多模态推荐日益利用其在语义理解和上下文推理方面的优势。然而，大模型表示本质上针对通用语义任务进行优化，而推荐模型则严重依赖于稀疏的用户/物品唯一身份（ID）特征。现有工作忽视了大模型与推荐系统之间根本的表征差异，导致多模态表示不兼容及推荐性能欠佳。为弥合这一差距，我们提出了RecGOAT，一种新颖而简洁的面向LLM增强多模态推荐的双语义对齐框架，其具备理论保证的对齐能力。RecGOAT首先利用图注意力网络，通过建模物品-物品、用户-物品和用户-用户关系，结合用户/物品的大模型表示与交互历史，以丰富协同语义。此外，我们设计了一种双粒度渐进式多模态-ID对齐框架，分别通过跨模态对比学习（CMCL）和最优自适应传输（OAT）实现实例级和分布级的语义对齐。理论上，我们证明了从该对齐框架导出的统一表示具有优异的语义一致性和全面性。在三个公开基准数据集上的大量实验表明，我们的RecGOAT实现了最先进的性能，从经验上验证了我们的理论见解。此外，在一个大规模在线广告平台上的部署证实了该模型在工业推荐场景中的有效性和可扩展性。代码发布于 https://github.com/6lyc/RecGOAT-LLM4Rec。