Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.

翻译：大规模搜索、推荐及检索增强生成（RAG）系统通常采用两阶段架构：早期排序器（ESR）生成候选集，随后由后期排序器（LSR）进行重新排序。尽管存在多种用于训练LSR的强化学习（RL）方法，但ESR的端到端训练仍颇具挑战。具体而言，朴素应用"原始"策略梯度（V-PG）无法适应实际场景中候选集规模带来的方差爆炸问题。该问题源于V-PG将梯度传播至候选集的联合概率，忽略了候选集中每个特定条目对奖励的贡献。为缓解此问题，我们提出一种新型"信用分配"策略梯度（CA-PG），该方法计算目标条目被任何候选集选中的概率梯度（即对所有包含该条目的候选集进行边缘化）。理论分析表明，CA-PG通过对候选集的具体组成进行边缘化，显著降低了V-PG的方差，同时在合理对齐的LSR策略下保留了学习正确条目排序的能力。在合成数据与真实数据上的实验证明，CA-PG能提升基于标准Plackett-Luce模型的ESR的收敛速度与训练稳定性，尤其在候选集规模较大时效果显著。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

【AAAI2026】TruthfulRAG：基于知识图谱解决检索增强生成中的事实层冲突

专知会员服务

22+阅读 · 2025年11月15日

检索增强生成（RAG）技术，261页slides

专知会员服务

42+阅读 · 2025年10月16日

北航团队提出SIDM：基于结构信息原理的通用分层决策框架

专知会员服务

19+阅读 · 2025年5月14日

检索增强生成系统中的可信度：综述

专知会员服务

44+阅读 · 2024年9月18日