Large language models (LLMs) have recently shown strong reasoning abilities in domains like mathematics, coding, and scientific problem-solving, yet their potential for ranking tasks, where prime examples include retrieval, recommender systems, and LLM routing, remains underexplored. Ranking requires complex reasoning across heterogeneous candidates, but existing LLM-based rankers are often domain-specific, tied to fixed backbones, and lack iterative refinement, limiting their ability to fully exploit LLMs' reasoning potential. To address these challenges, we propose R1-Ranker, a reasoning-incentive framework built on reinforcement learning, with two complementary designs: DRanker, which generates full rankings in one shot, and IRanker, which decomposes ranking into an iterative elimination process with step-wise rewards to encourage deeper reasoning. We evaluate unified R1-Rankers on nine datasets spanning recommendation, routing, and passage ranking, showing that IRanker-3B consistently achieves state-of-the-art performance, surpasses larger 7B models on some tasks, and yields a 15.7% average relative improvement. Ablation and generalization experiments further confirm the critical role of reinforcement learning and iterative reasoning, with IRanker-3B improving zero-shot performance by over 9% on out-of-domain tasks and reasoning traces boosting other LLMs by up to 22.87%. These results demonstrate that unifying diverse ranking tasks with a single reasoning-driven foundation model is both effective and essential for advancing LLM reasoning in ranking scenarios.
翻译:大型语言模型(LLM)近期在数学、编程和科学问题求解等领域展现出强大的推理能力,但其在排序任务(典型应用包括检索、推荐系统和LLM路由)中的潜力仍未得到充分探索。排序任务需要在异构候选项之间进行复杂推理,但现有基于LLM的排序器通常局限于特定领域、依赖固定主干网络且缺乏迭代优化,限制了其充分挖掘LLM推理潜力的能力。为应对这些挑战,我们提出R1-Ranker——一个基于强化学习的推理激励框架,包含两种互补设计:DRanker(单次生成完整排序列表)和IRanker(将排序分解为迭代淘汰过程,通过逐步奖励机制促进深度推理)。我们在涵盖推荐、路由和段落排序的九个数据集上对统一R1-Ranker进行评估,结果表明IRanker-3B模型持续实现最先进的性能,在部分任务上超越更大的7B模型,并获得15.7%的平均相对提升。消融实验与泛化实验进一步证实了强化学习和迭代推理的关键作用:IRanker-3B在跨域任务上的零样本性能提升超过9%,其推理轨迹可使其他LLM性能提升达22.87%。这些结果证明,通过单一推理驱动的基础模型统一多样化排序任务,对于推进LLM在排序场景中的推理能力既有效又必要。