Rerankers play a pivotal role in refining retrieval results for Retrieval-Augmented Generation. However, current reranking models are typically optimized on static human annotated relevance labels in isolation, decoupled from the downstream generation process. This isolation leads to a fundamental misalignment: documents identified as topically relevant by information retrieval metrics often fail to provide the actual utility required by the LLM for precise answer generation. To bridge this gap, we introduce ReRanking Preference Optimization (RRPO), a reinforcement learning framework that directly aligns reranking with the LLM's generation quality. By formulating reranking as a sequential decision-making process, RRPO optimizes for context utility using LLM feedback, thereby eliminating the need for expensive human annotations. To ensure training stability, we further introduce a reference-anchored deterministic baseline. Extensive experiments on knowledge-intensive benchmarks demonstrate that RRPO significantly outperforms strong baselines, including the powerful list-wise reranker RankZephyr. Further analysis highlights the versatility of our framework: it generalizes seamlessly to diverse readers (e.g., GPT-4o), integrates orthogonally with query expansion modules like Query2Doc, and remains robust even when trained with noisy supervisors.
翻译:重排序器在检索增强生成中扮演着优化检索结果的关键角色。然而,当前的重排序模型通常孤立地在静态人工标注的相关性标签上优化,与下游生成过程脱节。这种孤立性导致根本性错配:通过信息检索指标判定为与主题相关的文档,往往无法为大语言模型生成精确答案提供实际效用。为弥补这一差距,我们提出重排序偏好优化(RRPO)——一种将重排序直接与大语言模型生成质量对齐的强化学习框架。通过将重排序建模为序列决策过程,RRPO利用大语言模型反馈优化上下文效用,从而消除对昂贵人工标注的需求。为确保训练稳定性,我们进一步引入锚定参考的确定性基线。在知识密集型基准上的广泛实验表明,RRPO显著优于包括强列表式重排序器RankZephyr在内的多个强基线。进一步分析凸显了该框架的通用性:它能无缝扩展至不同阅读器(如GPT-4o),与Query2Doc等查询扩展模块正交集成,并且即使在噪声监督器下训练仍保持鲁棒性。