Are Multimodal Embeddings Truly Beneficial for Recommendation? A Deep Dive into Whole vs. Individual Modalities

Multimodal recommendation has emerged as a mainstream paradigm, typically leveraging text and visual embeddings extracted from pre-trained models such as Sentence-BERT, Vision Transformers, and ResNet. This approach is founded on the intuitive assumption that incorporating multimodal embeddings can enhance recommendation performance. However, despite its popularity, this assumption lacks comprehensive empirical verification. This presents a critical research gap. To address it, we pose the central research question of this paper: Are multimodal embeddings truly beneficial for recommendation? To answer this question, we conduct a large-scale empirical study examining the role of text and visual embeddings in modern multimodal recommendation models, both as a whole and individually. Specifically, we pose two key research questions: (1) Do multimodal embeddings as a whole improve recommendation performance? (2) Is each individual modality - text and image - useful when used alone? To isolate the effect of individual modalities - text or visual - we employ a modality knockout strategy by setting the corresponding embeddings to either constant values or random noise. To ensure the scale and comprehensiveness of our study, we evaluate 14 widely used state-of-the-art multimodal recommendation models. Our findings reveal that: (1) multimodal embeddings generally enhance recommendation performance - particularly when integrated through more sophisticated graph-based fusion models. Surprisingly, commonly adopted baseline models with simple fusion schemes, such as VBPR and BM3, show only limited gains. (2) The text modality alone achieves performance comparable to the full multimodal setting in most cases, whereas the image modality alone does not. These results offer foundational insights and practical guidance for the multimodal recommendation community.

翻译：多模态推荐已成为主流范式，通常利用从预训练模型（如Sentence-BERT、Vision Transformers和ResNet）中提取的文本与视觉嵌入。该方法基于一个直观假设：引入多模态嵌入能够提升推荐性能。然而，尽管该范式广为流行，这一假设仍缺乏全面的实证验证，形成了关键的研究空白。为填补这一空白，本文提出核心研究问题：多模态嵌入是否真正有益于推荐系统？为解答此问题，我们开展了大规模实证研究，系统考察了文本与视觉嵌入在现代多模态推荐模型中的作用——既包括整体作用，也涵盖个体贡献。具体而言，我们提出两个关键研究问题：（1）多模态嵌入作为整体是否能提升推荐性能？（2）文本与图像这两种独立模态在单独使用时是否有效？为分离文本或视觉等单一模态的影响，我们采用模态剔除策略，将对应嵌入设置为常数值或随机噪声。为确保研究的规模性与全面性，我们评估了14个广泛使用的先进多模态推荐模型。研究发现：（1）多模态嵌入通常能提升推荐性能——尤其在通过更复杂的基于图的融合模型进行整合时。令人惊讶的是，采用简单融合方案的常用基线模型（如VBPR和BM3）仅表现出有限的性能增益。（2）单独使用文本模态在多数情况下能达到与完整多模态设置相当的性能，而单独使用图像模态则无法实现。这些结果为多模态推荐领域提供了基础性见解与实践指导。