RAG-VisualRec: An Open Resource for Vision- and Text-Enhanced Retrieval-Augmented Generation in Recommendation

This paper addresses the challenge of building multimodal recommender systems for the movie domain, where sparse item metadata (e.g., title and genres) can limit retrieval quality and downstream recommendations. We introduce RAG-VisualRec, an open resource and reproducible pipeline that combines (i) LLM-generated item-side plot descriptions and (ii) trailer-derived visual (and optional audio) embeddings, supporting both retrieval-augmented generation (RAG) and collaborative-filtering style workflows. Our pipeline augments sparse metadata into richer textual signals and integrates modalities via configurable fusion strategies (e.g., PCA and CCA) before retrieval and optional LLM-based re-ranking. Beyond providing the resource, we provide a complementary analysis that increases transparency and reproducibility. In particular, we introduce LLMGenQC, a critic-based quality-control module (LLM-as-judge) that audits synthetic synopses for semantic alignment with metadata, consistency, safety, and basic sanity checks, releasing critic scores and pass/fail labels alongside the generated artifacts. We report ablation studies that quantify the impact of key design choices, including retrieval depth, fusion strategy, and user-embedding construction. Across experiments, CCA-based fusion consistently improves recall over unimodal baselines, while LLM-based re-ranking typically improves nDCG by refining top-K selection from the retrieved candidate pool, especially when textual evidence is limited. By releasing RAG-VisualRec, we enable further research on multimodal RAG recommenders, quality auditing of LLM-generated side information, and long-tail oriented evaluation protocols. All code, data, and detailed documentation are publicly available at: https://github.com/RecSys-lab/RAG-VisualRec.

翻译：本文针对电影领域构建多模态推荐系统所面临的挑战展开研究，其中稀疏的物品元数据（如标题和类型）可能限制检索质量与下游推荐效果。我们提出了RAG-VisualRec，这是一个开放资源及可复现的流程框架，它融合了（i）大语言模型生成的物品侧剧情描述与（ii）基于预告片提取的视觉（及可选音频）嵌入表示，同时支持检索增强生成（RAG）与协同过滤式工作流。该流程将稀疏元数据增强为更丰富的文本信号，并在检索及可选的大语言模型重排序之前，通过可配置的融合策略（如主成分分析PCA与典型相关分析CCA）实现多模态集成。除提供资源外，我们还提供了增强透明度与可复现性的补充分析。具体而言，我们引入了LLMGenQC，这是一个基于评判的质量控制模块（以LLM作为评判者），用于审核生成式剧情摘要与元数据的语义对齐性、一致性、安全性及基本合理性检查，并随生成产物一同发布评判分数与通过/未通过标签。我们通过消融实验量化了关键设计选择的影响，包括检索深度、融合策略和用户嵌入构建方式。在所有实验中，基于CCA的融合策略相较于单模态基线持续提升了召回率，而基于大语言模型的重排序则通过优化从检索候选池中选取的Top-K结果，显著提升了归一化折损累计增益（nDCG），尤其在文本证据有限的情况下效果更为明显。通过发布RAG-VisualRec，我们为多模态RAG推荐系统、大语言模型生成侧信息的质量审计以及面向长尾场景的评估方案提供了进一步研究基础。所有代码、数据及详细文档均已公开于：https://github.com/RecSys-lab/RAG-VisualRec。