Recent studies increasingly explore Large Language Models (LLMs) as a new paradigm for recommendation systems due to their scalability and world knowledge. However, existing work has three key limitations: (1) most efforts focus on retrieval and ranking, while the reranking phase, critical for refining final recommendations, is largely overlooked; (2) LLMs are typically used in zero-shot or supervised fine-tuning settings, leaving their reasoning abilities, especially those enhanced through reinforcement learning (RL) and high-quality reasoning data, underexploited; (3) items are commonly represented by non-semantic IDs, creating major scalability challenges in industrial systems with billions of identifiers. To address these gaps, we propose the Generative Reasoning Reranker (GR2), an end-to-end framework with a three-stage training pipeline tailored for reranking. First, a pretrained LLM is mid-trained on semantic IDs encoded from non-semantic IDs via a tokenizer achieving $\ge$99% uniqueness. Next, a stronger larger-scale LLM generates high-quality reasoning traces through carefully designed prompting and rejection sampling, which are used for supervised fine-tuning to impart foundational reasoning skills. Finally, we apply Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO), enabling scalable RL supervision with verifiable rewards designed specifically for reranking. Experiments on two real-world datasets demonstrate GR2's effectiveness: it surpasses the state-of-the-art OneRec-Think by 2.4% in Recall@5 and 1.3% in NDCG@5. Ablations confirm that advanced reasoning traces yield substantial gains across metrics. We further find that RL reward design is crucial in reranking: LLMs tend to exploit reward hacking by preserving item order, motivating conditional verifiable rewards to mitigate this behavior and optimize reranking performance.
翻译:近期研究逐渐探索将大型语言模型(LLMs)作为推荐系统的新范式,因其具备可扩展性和世界知识。然而现有工作存在三个关键局限:(1) 多数研究聚焦检索与排序阶段,而优化最终推荐结果的重排序阶段常被忽视;(2) 大型语言模型通常以零样本或有监督微调方式使用,其通过强化学习与高质量推理数据增强的推理能力未得到充分挖掘;(3) 项目通常以非语义标识符表示,在拥有数十亿标识符的工业系统中引发严重可扩展性问题。针对上述空白,我们提出生成式推理重排序器(GR2),这是一种采用三阶段训练流水线的端到端重排序框架。首先,通过编码器对非语义标识符进行语义化编码(可实现≥99%唯一性),并用于预训练语言模型的中期训练。其次,利用更强大的大规模语言模型,通过精心设计的提示与拒绝采样生成高质量推理轨迹,再通过有监督微调注入基础推理能力。最后,我们提出解耦裁剪与动态采样策略优化(DAPO),通过专为重排序设计可验证奖励,实现可扩展的强化学习监督。在两个真实数据集上的实验表明GR2的有效性:其在Recall@5和NDCG@5指标上分别超越当前最优方法OneRec-Think 2.4%和1.3%。消融实验证实,先进推理轨迹能在所有指标上带来显著提升。我们进一步发现,奖励设计在重排序中至关重要:大型语言模型倾向于通过保留项目顺序来利用奖励黑客行为,这促使我们提出条件化可验证奖励以抑制该行为并优化重排序性能。