Generative Reasoning Re-ranker

Mingfu Liang,Yufei Li,Jay Xu,Kavosh Asadi,Xi Liu,Shuo Gu,Kaushik Rangadurai,Frank Shyu,Shuaiwen Wang,Song Yang,Zhijing Li,Jiang Liu,Mengying Sun,Fei Tian,Xiaohan Wei,Chonglin Sun,Jacob Tao,Shike Mei,Wenlin Chen,Santanu Kolay,Sandeep Pandey,Hamed Firooz,Luke Simon

from arxiv, 31 pages

Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.

翻译：近期研究逐渐探索将大型语言模型(LLMs)作为推荐系统的新范式，因其具备可扩展性和世界知识。然而现有工作存在三个关键局限：(1) 多数研究聚焦检索与排序阶段，而优化最终推荐结果的重排序阶段常被忽视；(2) 大型语言模型通常以零样本或有监督微调方式使用，其通过强化学习与高质量推理数据增强的推理能力未得到充分挖掘；(3) 项目通常以非语义标识符表示，在拥有数十亿标识符的工业系统中引发严重可扩展性问题。针对上述空白，我们提出生成式推理重排序器(GR2)，这是一种采用三阶段训练流水线的端到端重排序框架。首先，通过编码器对非语义标识符进行语义化编码（可实现≥99%唯一性），并用于预训练语言模型的中期训练。其次，利用更强大的大规模语言模型，通过精心设计的提示与拒绝采样生成高质量推理轨迹，再通过有监督微调注入基础推理能力。最后，我们提出解耦裁剪与动态采样策略优化(DAPO)，通过专为重排序设计可验证奖励，实现可扩展的强化学习监督。在两个真实数据集上的实验表明GR2的有效性：其在Recall@5和NDCG@5指标上分别超越当前最优方法OneRec-Think 2.4%和1.3%。消融实验证实，先进推理轨迹能在所有指标上带来显著提升。我们进一步发现，奖励设计在重排序中至关重要：大型语言模型倾向于通过保留项目顺序来利用奖励黑客行为，这促使我们提出条件化可验证奖励以抑制该行为并优化重排序性能。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【WWW2025】G-Refer：基于图检索增强的大型语言模型用于可解释推荐

专知会员服务

13+阅读 · 2025年4月8日

大规模语言模型增强推荐系统：分类、趋势、应用与未来

专知会员服务

41+阅读 · 2024年12月22日

如何构建o1模型推理能力？清华北大等提出LLaVA-o1: 让视觉语言模型逐步推理

专知会员服务

31+阅读 · 2024年11月19日

大语言模型在序列推荐中的应用

专知会员服务

19+阅读 · 2024年11月12日