Multimodal large language models (MLLMs) are pushing recommender systems (RecSys) toward content-grounded retrieval and ranking via cross-modal fusion. We find that while cross-modal consensus often mitigates conventional poisoning that manipulates interaction logs or perturbs a single modality, it also introduces a new attack surface where synchronised multimodal poisoning can reliably steer fused representations along stable semantic directions during fine-tuning. To characterise this threat, we formalise cross-modal interactive poisoning and propose VENOMREC, which performs Exposure Alignment to identify high-exposure regions in the joint embedding space and Cross-modal Interactive Perturbation to craft attention-guided coupled token-patch edits. Experiments on three real-world multimodal datasets demonstrate that VENOMREC consistently outperforms strong baselines, achieving 0.73 mean ER@20 and improving over the strongest baseline by +0.52 absolute ER points on average, while maintaining comparable recommendation utility.
翻译:多模态大语言模型(MLLMs)正通过跨模态融合推动推荐系统(RecSys)向基于内容的检索与排序方向发展。我们发现,尽管跨模态共识通常能缓解通过操纵交互日志或扰动单一模态的传统投毒攻击,但它也引入了一个新的攻击面:在微调过程中,同步的多模态投毒能够可靠地引导融合表征沿着稳定的语义方向偏移。为刻画此威胁,我们形式化了跨模态交互式投毒攻击,并提出了VENOMREC。该方法通过"曝光对齐"来识别联合嵌入空间中的高曝光区域,并利用"跨模态交互式扰动"来生成注意力引导的耦合令牌-图像块编辑。在三个真实世界多模态数据集上的实验表明,VENOMREC始终优于强基线方法,平均ER@20达到0.73,较最强基线平均绝对提升了+0.52个ER点,同时保持了相当的推荐效用。