Multimodal recommendation is commonly framed as a feature fusion problem, where textual and visual signals are combined to better model user preference. However, the effectiveness of multimodal recommendation may depend not only on how modalities are fused, but also on whether item content is represented in a semantic space aligned with preference matching. This issue is particularly important because raw visual features often preserve appearance similarity, while user decisions are typically driven by higher-level semantic factors such as style, material, and usage context. Motivated by this observation, we propose LVLM-grounded Multimodal Semantic Representation for Recommendation (VLM4Rec), a lightweight framework that organizes multimodal item content through semantic alignment rather than direct feature fusion. VLM4Rec first uses a large vision-language model to ground each item image into an explicit natural-language description, and then encodes the grounded semantics into dense item representations for preference-oriented retrieval. Recommendation is subsequently performed through a simple profile-based semantic matching mechanism over historical item embeddings, yielding a practical offline-online decomposition. Extensive experiments on multiple multimodal recommendation datasets show that VLM4Rec consistently improves performance over raw visual features and several fusion-based alternatives, suggesting that representation quality may matter more than fusion complexity in this setting. The code is released at https://github.com/tyvalencia/enhancing-mm-rec-sys.
翻译:多模态推荐通常被构建为特征融合问题,即结合文本与视觉信号以更好地建模用户偏好。然而,多模态推荐的有效性不仅取决于模态融合的方式,还取决于项目内容是否在与偏好匹配对齐的语义空间中得以表征。这一问题尤为重要,因为原始视觉特征往往保留外观相似性,而用户决策通常由更高层次的语义因素(如风格、材质和使用情境)驱动。基于这一观察,我们提出基于大型视觉-语言模型的多模态语义表征推荐框架(VLM4Rec),这是一个轻量级框架,通过语义对齐而非直接特征融合来组织多模态项目内容。VLM4Rec首先使用大型视觉-语言模型将每个项目图像锚定至显式的自然语言描述,随后将锚定语义编码为稠密项目表征,以进行面向偏好的检索。推荐随后通过基于历史项目嵌入的简单配置文件语义匹配机制执行,实现了实用的离线-在线解耦。在多个多模态推荐数据集上的大量实验表明,VLM4Rec相较于原始视觉特征及多种基于融合的替代方法,性能持续提升,这表明在此场景下表征质量可能比融合复杂度更为关键。代码发布于 https://github.com/tyvalencia/enhancing-mm-rec-sys。