Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.
翻译:大规模搜索、推荐及检索增强生成(RAG)系统通常采用两阶段架构:早期排序器(ESR)生成候选集,随后由后期排序器(LSR)进行重新排序。尽管存在多种用于训练LSR的强化学习(RL)方法,但ESR的端到端训练仍颇具挑战。具体而言,朴素应用"原始"策略梯度(V-PG)无法适应实际场景中候选集规模带来的方差爆炸问题。该问题源于V-PG将梯度传播至候选集的联合概率,忽略了候选集中每个特定条目对奖励的贡献。为缓解此问题,我们提出一种新型"信用分配"策略梯度(CA-PG),该方法计算目标条目被任何候选集选中的概率梯度(即对所有包含该条目的候选集进行边缘化)。理论分析表明,CA-PG通过对候选集的具体组成进行边缘化,显著降低了V-PG的方差,同时在合理对齐的LSR策略下保留了学习正确条目排序的能力。在合成数据与真实数据上的实验证明,CA-PG能提升基于标准Plackett-Luce模型的ESR的收敛速度与训练稳定性,尤其在候选集规模较大时效果显著。