Large Language Model (LLM) based listwise ranking has shown superior performance in many passage ranking tasks. With the development of Large Reasoning Models (LRMs), many studies have demonstrated that step-by-step reasoning during test-time helps improve listwise ranking performance. However, due to the scarcity of reasoning-intensive training data, existing rerankers perform poorly in many complex ranking scenarios, and the ranking ability of reasoning-intensive rerankers remains largely underdeveloped. In this paper, we first propose an automated reasoning-intensive training data synthesis framework, which sources training queries and passages from diverse domains and applies DeepSeek-R1 to generate high-quality training labels. To empower the listwise reranker with strong reasoning ability, we further propose a two-stage training approach, which includes a cold-start supervised fine-tuning (SFT) stage and a reinforcement learning (RL) stage. During the RL stage, we design a novel multi-view ranking reward tailored to the multi-turn nature of listwise ranking. Extensive experiments demonstrate that our trained reasoning-intensive reranker \textbf{ReasonRank} outperforms existing baselines significantly and also achieves much lower latency than the pointwise reranker. Our codes are available at https://github.com/8421BCD/ReasonRank.
翻译:基于大语言模型(LLM)的列表式排序方法已在多项段落排序任务中展现出优越性能。随着大型推理模型(LRM)的发展,大量研究表明在测试时进行逐步推理有助于提升列表式排序性能。然而,由于缺乏推理密集型训练数据,现有重排序器在诸多复杂排序场景下表现欠佳,且推理密集型重排序器的排序能力仍待充分挖掘。本文首先提出一种自动化推理密集型训练数据合成框架,该框架从多领域获取训练查询与段落,并应用DeepSeek-R1生成高质量训练标签。为赋予列表式重排序器强大的推理能力,我们进一步提出两阶段训练方法,包含冷启动监督微调(SFT)阶段与强化学习(RL)阶段。在RL阶段,我们针对列表式排序的多轮交互特性,设计了一种新颖的多视角排序奖励机制。大量实验表明,我们训练的推理密集型重排序器\textbf{ReasonRank}显著优于现有基线模型,且延迟远低于逐点式重排序器。我们的代码已开源至https://github.com/8421BCD/ReasonRank。