Group relative policy optimization (GRPO) has become a standard post-training paradigm for improving reasoning and preference alignment in large language models (LLMs), and has recently shown strong effectiveness in LLM-based recommender systems. However, extending GRPO-based reasoning pipelines to multimodal sequential recommendation (MSR) with multimodal large language models (MLLMs) faces fundamental obstacles. First, MSR requires jointly encoding visual content for both historical interactions and multiple candidate items, causing visual tokens to dominate the input and making the cost of group-based rollout scale with history length and candidate set size, which renders GRPO-based training prohibitively expensive. Second, existing Chain-of-Thought (CoT) supervision suffers from reward inflation in recommendation scenarios, where higher training rewards do not reliably translate into improved ranking performance and may induce shortcut learning. To address these challenges, we propose MLLMRec-R1, an efficient and stable GRPO-based reasoning framework for multimodal sequential recommendation. MLLMRec-R1 textualizes visual signals offline to eliminate expensive visual tokens while preserving multimodal semantics, and constructs high-quality multimodal CoT supervision through refinement and confidence-aware assessment. Furthermore, a mixed-grained data augmentation strategy selectively injects reliable CoT samples while retaining standard training data, mitigating reward inflation and improving generalization stability. Extensive experiments on three benchmark datasets demonstrate that MLLMRec-R1 consistently outperforms state-of-the-art methods, establishing a practical and effective GRPO-based reasoning pipeline for multimodal sequential recommendation. The code is available at https://github.com/wangyu0627/MLLMRec-R1.
翻译:群体相对策略优化(GRPO)已成为提升大语言模型(LLM)推理与偏好对齐的标准后训练范式,近期在基于LLM的推荐系统中展现出显著效果。然而,将基于GRPO的推理流程扩展至多模态大语言模型(MLLM)驱动的多模态序列推荐(MSR)面临根本性障碍。首先,MSR需对历史交互与多个候选项目的视觉内容进行联合编码,导致视觉标记主导输入,且基于群体的轨迹生成成本随历史长度与候选集规模增长,使得基于GRPO的训练极其昂贵。其次,现有思维链(CoT)监督在推荐场景中存在奖励膨胀问题,更高的训练奖励无法可靠转化为排序性能提升,反而可能引发捷径学习。为应对这些挑战,我们提出MLLMRec-R1——一种高效稳定的基于GRPO的多模态序列推荐推理框架。MLLMRec-R1通过离线文本化视觉信号以消除昂贵的视觉标记同时保留多模态语义,并借助精炼与置信度感知评估构建高质量多模态CoT监督。此外,混合粒度数据增强策略选择性注入可靠CoT样本并保留标准训练数据,从而缓解奖励膨胀并提升泛化稳定性。在三个基准数据集上的大量实验表明,MLLMRec-R1持续优于现有最优方法,为多模态序列推荐建立了实用有效的基于GRPO的推理流程。代码发布于https://github.com/wangyu0627/MLLMRec-R1。