Retrieval-Augmented Generation (RAG) improves generation quality by incorporating evidence retrieved from large external corpora. However, most existing methods rely on statically selecting top-k passages based on individual relevance, which fails to exploit combinatorial gains among passages and often introduces substantial redundancy. To address this limitation, we propose OptiSet, a set-centric framework that unifies set selection and set-level ranking for RAG. OptiSet adopts an "Expand-then-Refine" paradigm: it first expands a query into multiple perspectives to enable a diverse candidate pool and then refines the candidate pool via re-selection to form a compact evidence set. We then devise a self-synthesis strategy without strong LLM supervision to derive preference labels from the set conditional utility changes of the generator, thereby identifying complementary and redundant evidence. Finally, we introduce a set-list wise training strategy that jointly optimizes set selection and set-level ranking, enabling the model to favor compact, high-gain evidence sets. Extensive experiments demonstrate that OptiSet improves performance on complex combinatorial problems and makes generation more efficient. The source code is publicly available.
翻译:检索增强生成(RAG)通过从大型外部语料库中检索证据来提升生成质量。然而,现有方法大多基于单篇相关性静态选择前k个段落,这既无法利用段落间的组合增益,又常常引入大量冗余。为克服这一局限,我们提出OptiSet——一个面向RAG的集合中心化框架,将集合选择与集合级排序进行统一优化。OptiSet采用“扩展-精炼”范式:首先将查询扩展为多视角表述以构建多样化候选池,随后通过重选择机制精炼候选池,形成紧凑的证据集合。我们进一步设计了无需强LLM监督的自合成策略,通过生成器在集合条件效用变化中推导偏好标签,从而识别互补性与冗余性证据。最后,我们提出集合列表序训练策略,联合优化集合选择与集合级排序,使模型倾向于选择紧凑且高增益的证据集合。大量实验表明,OptiSet在复杂组合问题上提升了性能,并使生成过程更高效。源代码已公开。