Many modern retrieval problems are set-valued: given a broad intent, the system must return a collection of results that optimizes higher-order properties (e.g., diversity, coverage, complementarity, coherence) while remaining grounded with respect to a fixed database. Set-valued objectives are typically non-decomposable and are not captured by existing supervised (query, content) datasets which only prioritize top-1 retrieval. Consequently, fan-out retrieval is often employed to generate diverse subqueries to retrieve item sets. While reinforcement learning (RL) can optimize set-level objectives via interaction, deploying an RL-tuned LLM for fan-out retrieval is prohibitively expensive at inference time. Conversely, diffusion-based generative retrieval enables efficient single-pass fan-out in embedding space, but requires objective-aligned training targets. To address these issues, we propose R4T (Retrieve-for-Train), which uses RL once as an objective transducer in a three-step process: (i) train a fan-out LLM with composite set-level rewards, (ii) synthesize objective-consistent training pairs, and (iii) train a lightweight diffusion retriever to model the conditional distribution of set-valued outputs. Across large-scale fashion and music benchmarks consisting of curated item sets, we show that R4T improves retrieval quality relative to strong baselines while reducing query-time fan-out latency by an order of magnitude.
翻译:许多现代检索问题本质上是集合值问题:给定一个宽泛的意图,系统必须返回一个结果集合,该集合在优化高阶属性(例如多样性、覆盖率、互补性、连贯性)的同时,仍需基于固定数据库保持相关性。集合值目标通常不可分解,且未被现有的仅优先考虑Top-1检索的监督式(查询,内容)数据集所涵盖。因此,扇出检索常被用于生成多样化的子查询以检索项目集合。虽然强化学习可以通过交互优化集合层面的目标,但在推理时部署一个经过强化学习调优的大型语言模型进行扇出检索成本过高。相反,基于扩散的生成式检索能够在嵌入空间实现高效的单次扇出,但需要与目标对齐的训练目标。为解决这些问题,我们提出了R4T(为训练而检索),该方法将强化学习作为目标转换器,采用三步流程:(i)使用复合集合级奖励训练一个扇出大型语言模型,(ii)合成与目标一致的训练对,(iii)训练一个轻量级扩散检索器以建模集合值输出的条件分布。在由精选项目集合组成的大规模时尚和音乐基准测试中,我们证明R4T相对于强基线提高了检索质量,同时将查询时扇出延迟降低了一个数量级。