Multimodal Large Language Models with Adaptive Preference Optimization for Sequential Recommendation

Recent advances in Large Language Models (LLMs) have opened new avenues for sequential recommendation by enabling natural language reasoning over user behavior sequences. A common approach formulates recommendation as a language modeling task, where interaction histories are transformed into prompts and user preferences are learned via supervised fine-tuning. However, these methods operate solely in the textual modality and often miss users' fine-grained interests, especially when shaped by rich visual signals such as product images or movie posters. Multimodal Large Language Models (MLLMs) offer a promising alternative by aligning text and vision in a shared semantic space. A prevalent training paradigm applies Supervised Fine-Tuning (SFT) followed by Direct Preference Optimization (DPO) to model user preferences. Yet, two core challenges remain: 1) Imbalanced sample hardness, where random negative sampling causes overfitting on easy examples and under-training on hard ones; 2) Cross-modal semantic bias, where the fixed reference model in DPO prevents the policy model from correcting modality misalignments--especially over long sequences. To address these issues, we propose a Multimodal LLM framework that integrates Hardness-aware and Noise-regularized preference optimization for Recommendation (HaNoRec). Specifically, HaNoRec dynamically adjusts optimization weights based on both the estimated hardness of each training sample and the policy model's real-time responsiveness, prioritizing harder examples. It further introduces Gaussian-perturbed distribution optimization on output logits to enhance cross-modal semantic consistency and reduce modality bias inherited from the reference model.

翻译：近期大语言模型（LLMs）的突破性进展，通过实现对用户行为序列的自然语言推理，为序列推荐开辟了新路径。主流方法将推荐任务转化为语言建模任务：将交互历史转化为提示文本，并通过监督微调学习用户偏好。然而，此类方法仅在文本模态中运作，往往缺失用户细粒度兴趣——尤其是当产品图片、电影海报等丰富视觉信号塑造用户偏好时。多模态大语言模型（MLLMs）通过在共享语义空间中对齐文本与视觉特征，提供了富有潜力的替代方案。当前主流训练范式采用监督微调（SFT）后接直接偏好优化（DPO）来建模用户偏好，但仍面临两大核心挑战：1）样本难度失衡——随机负采样导致模型过度拟合简单样本而欠训练困难样本；2）跨模态语义偏差——DPO中固定的参考模型阻碍策略模型修正模态对齐偏差（尤其在长序列场景中）。针对上述问题，我们提出融合难度感知与噪声正则化偏好优化的多模态LLM框架（HaNoRec）。具体而言，HaNoRec基于训练样本的预估难度与策略模型的实时响应能力动态调整优化权重，优先处理困难样本；同时引入对输出logits的高斯扰动分布优化，增强跨模态语义一致性，降低从参考模型继承的模态偏差。