Sequential Recommendation (SR) in multimodal settings typically relies on small frozen pretrained encoders, which limits semantic capacity and prevents Collaborative Filtering (CF) signals from being fully integrated into item representations. Inspired by the recent success of Large Language Models (LLMs) as high-capacity embedders, we investigate the use of Vision-Language Models (VLMs) as CF-aware multimodal encoders for SR. However, we find that standard contrastive supervised fine-tuning (SFT), which adapts VLMs for embedding generation and injects CF signals, can amplify its inherent modality collapse. In this state, optimization is dominated by a single modality while the other degrades, ultimately undermining recommendation accuracy. To address this, we propose VLM2Rec, a VLM embedder-based framework for multimodal sequential recommendation designed to ensure balanced modality utilization. Specifically, we introduce Weak-modality Penalized Contrastive Learning to rectify gradient imbalance during optimization and Cross-Modal Relational Topology Regularization to preserve geometric consistency between modalities. Extensive experiments demonstrate that VLM2Rec consistently outperforms state-of-the-art baselines in both accuracy and robustness across diverse scenarios.
翻译:多模态场景下的序列推荐通常依赖于小型冻结预训练编码器,这限制了语义容量并阻碍了协同过滤信号充分整合到物品表征中。受近期大型语言模型作为高容量嵌入器取得成功的启发,我们研究了将视觉-语言模型用作序列推荐中具有协同过滤感知能力的多模态编码器。然而,我们发现标准的对比监督微调方法虽然能适配VLM以生成嵌入并注入CF信号,却可能加剧其固有的模态坍缩问题。在此状态下,优化过程被单一模态主导而另一模态性能退化,最终损害推荐准确性。为解决这一问题,我们提出了VLM2Rec——一个基于VLM嵌入器的多模态序列推荐框架,旨在确保模态利用的平衡性。具体而言,我们引入了弱模态惩罚对比学习以修正优化过程中的梯度失衡,并采用跨模态关系拓扑正则化来保持模态间的几何一致性。大量实验表明,VLM2Rec在不同场景下的准确性与鲁棒性均持续优于现有最先进的基线方法。