Multimodal sequential recommendation (MSR) incorporates textual and visual information to improve recommendation quality. However, recent studies and our empirical analysis show that visual features are often underutilized, thereby contributing far less than textual signals. We attribute this issue to two factors: insufficient visual representation learning (pretrained encoders fail to capture preference-relevant cues) and unbalanced visual-text optimization (textual features dominate the learning process). To address these issues, we propose Teach Multimodal Recommendation Model to See via Personalized Visual Extraction and Adaptive Learning (REVEAL), a plug-and-play framework that enhances visual representation learning and cross-modal optimization without modifying the original recommendation backbone. REVEAL consists of Feedback-Guided Visual Extraction (FVE), which refines prompt-guided visual extraction through task-level feedback, and Adaptive Visual Learning (AVL), which dynamically reweights visual learning to alleviate modality imbalance. Experiments on multiple real-world datasets and MSR backbones demonstrate that REVEAL consistently improves recommendation performance. Further analysis shows that these gains arise from more effective attention to preference-relevant visual regions and better visual utilization during training. The code is available at https://github.com/YutongLi2024/REVEAL.
翻译:多模态序列推荐(MSR)结合文本和视觉信息来提升推荐质量。然而,近期研究及我们的实证分析表明,视觉特征常未被充分利用,其贡献远低于文本信号。我们将此归因于两个因素:视觉表示学习不足(预训练编码器未能捕捉偏好相关线索)以及视觉-文本优化不平衡(文本特征主导学习过程)。为解决这些问题,我们提出“通过个性化视觉提取和自适应学习教会多模态推荐模型‘看’”(REVEAL),这是一个即插即用的框架,在不修改原始推荐主干的情况下增强视觉表示学习和跨模态优化。REVEAL由反馈引导的视觉提取(FVE)和自适应视觉学习(AVL)组成,前者通过任务级反馈优化提示引导的视觉提取,后者动态重新加权视觉学习以缓解模态不平衡。在多个真实世界数据集和MSR骨干上的实验表明,REVEAL一致提升了推荐性能。进一步分析显示,这些提升源于对偏好相关视觉区域更有效的注意力以及训练过程中更好的视觉利用。代码已开源:https://github.com/YutongLi2024/REVEAL。