Generative Reasoning Re-ranker

Mingfu Liang,Yufei Li,Jay Xu,Kavosh Asadi,Xi Liu,Shuo Gu,Kaushik Rangadurai,Frank Shyu,Shuaiwen Wang,Song Yang,Zhijing Li,Jiang Liu,Mengying Sun,Fei Tian,Xiaohan Wei,Chonglin Sun,Jacob Tao,Shike Mei,Wenlin Chen,Santanu Kolay,Sandeep Pandey,Hamed Firooz,Luke Simon

from arxiv, 31 pages

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

翻译：近期研究日益探索将大型语言模型（LLM）作为推荐系统的新范式，这得益于其可扩展性和世界知识。然而，现有工作存在三个关键局限：（1）多数研究聚焦于检索与排序阶段，而对优化最终推荐至关重要的重排序阶段则被严重忽视；（2）LLM通常以零样本或有监督微调方式使用，其推理能力——尤其是通过强化学习（RL）与高质量推理数据增强的能力——尚未得到充分挖掘；（3）物品普遍以非语义ID表示，这在拥有数十亿标识符的工业系统中引发了严重的可扩展性挑战。为弥补这些不足，我们提出生成式推理重排序器（GR2），这是一个专为重排序设计、包含三阶段训练流程的端到端框架。首先，通过分词器将非语义ID编码为语义ID（唯一性≥99%），并以此对预训练LLM进行中期训练。随后，使用更强的大规模LLM通过精心设计的提示与拒绝采样生成高质量推理轨迹，这些轨迹用于有监督微调以传授基础推理技能。最后，我们采用解耦裁剪与动态采样策略优化（DAPO），实现可扩展的强化学习监督，并专门为重排序设计了可验证的奖励机制。在两个真实数据集上的实验证明了GR2的有效性：其在Recall@5上超越当前最优方法OneRec-Think 2.4%，在NDCG@5上提升1.3%。消融实验证实，高级推理轨迹能带来各项指标的显著增益。我们进一步发现强化学习奖励设计对重排序至关重要：LLM倾向于通过保持物品顺序进行奖励攻击，这促使我们采用条件可验证奖励机制以抑制该行为并优化重排序性能。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

关于大语言模型驱动的推荐系统智能体的综述

专知会员服务

29+阅读 · 2025年2月17日

大规模语言模型增强推荐系统：分类、趋势、应用与未来

专知会员服务

40+阅读 · 2024年12月22日

如何构建o1模型推理能力？清华北大等提出LLaVA-o1: 让视觉语言模型逐步推理

专知会员服务

30+阅读 · 2024年11月19日

大语言模型在序列推荐中的应用

专知会员服务

19+阅读 · 2024年11月12日