Rank4Gen: RAG-Preference-Aligned Document Set Selection and Ranking

In the RAG paradigm, document ranking determines the evidence available to downstream generators. Through controlled analysis, we identify two phenomena underexplored by existing rankers: (i) downstream response quality depends not only on relevance but also on the composition and ordering of selected documents, and (ii) such preferences differ systematically across generators. However, existing rankers are trained purely on query--document relevance, leaving both phenomena unmodeled. To close this gap, we construct \textbf{PRISM}, a bilingual preference-aligned dataset built through a four-stage pipeline that compresses the combinatorial subset-and-ordering space by roughly four orders of magnitude and produces response-quality preference supervision conditioned on seven downstream generators. On a 13k-query subset of PRISM, we train \textbf{Rank4Gen}, a generator-aware ranker that performs joint document set selection and ordering. Experiments on five challenging RAG benchmarks show that Rank4Gen improves downstream QA quality on most evaluated generators, with per-generator F1 gains of up to $+2.08$ over the strongest set-selection baseline. Code is available at https://github.com/JOHNNY-fans/Rank4Gen.

翻译：在RAG范式中，文档排序决定了下游生成器可利用的证据。通过受控分析，我们发现了现有排序器未充分探索的两种现象：（i）下游响应质量不仅取决于相关性，还取决于所选文档的构成与顺序；（ii）此类偏好随生成器的不同而存在系统性差异。然而，现有排序器仅基于查询-文档相关性进行训练，两种现象均未被建模。为填补这一空白，我们构建了\textbf{PRISM}——一个双语言偏好对齐数据集，该数据集通过四级流水线构建，将组合子集与排序空间压缩约四个数量级，并在七种下游生成器的条件下生成基于响应质量的偏好监督信号。基于PRISM的13,000查询子集，我们训练了\textbf{Rank4Gen}——一个执行联合文档集合选择与排序的生成器感知排序器。在五个具有挑战性的RAG基准测试上的实验表明，Rank4Gen在大多数评估的生成器上提升了下游问答质量，与最强集合选择基线相比，每个生成器的F1分数提升最高可达$+2.08$。代码已开源至https://github.com/JOHNNY-fans/Rank4Gen。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

检索增强生成（RAG）技术，261页slides

专知会员服务

42+阅读 · 2025年10月16日

【新书】Essential GraphRAG: 知识图谱增强的RAG

专知会员服务

35+阅读 · 2025年7月17日