Reinforcement learning (RL) effectively optimizes Large Language Model (LLM)-based recommenders by contrasting positive and negative items. Empirically, training with beam-search negatives consistently outperforms random negatives, yet the mechanism is not well understood. We address this gap by analyzing the induced optimization objective and show that: (i) Under binary reward feedback, optimizing LLM recommenders with Group Relative Policy Optimization (GRPO) is theoretically equivalent to maximizing the Area Under the ROC Curve (AUC), which is often misaligned with Top-$K$ recommendation; and (ii) Replacing random negatives with beam-search negatives reshapes the objective toward partial AUC, improving alignment with Top-$K$ metrics. Motivated by this perspective, we introduce Windowed Partial AUC (WPAUC), which constrains the false positive rate (FPR) to a window [$α,α+d$] to more directly align with Top-$K$ metrics. We further propose an efficient Threshold-Adjusted Windowed reweighting (TAWin) RL method for its optimization, enabling explicit control over the targeted Top-$K$ performance. Experiments on four real-world datasets validate the theory and deliver consistent state-of-the-art performance.
翻译:强化学习(RL)通过对比正负样本有效优化基于大语言模型(LLM)的推荐系统。实验表明,使用集束搜索负样本进行训练的效果始终优于随机负样本,然而其内在机制尚未得到充分理解。本文通过分析诱导出的优化目标来填补这一空白,并证明:(i) 在二元奖励反馈机制下,使用组相对策略优化(GRPO)优化LLM推荐系统在理论上等价于最大化ROC曲线下面积(AUC),这通常与Top-$K$推荐存在目标偏差;(ii) 将随机负样本替换为集束搜索负样本可将优化目标重塑为部分AUC,从而提升与Top-$K$指标的契合度。基于这一视角,我们提出窗口化部分AUC(WPAUC),该方法将假阳性率(FPR)约束在区间[$α,α+d$]内,以实现与Top-$K$指标更直接的对齐。我们进一步提出高效的阈值自适应窗口化重加权(TAWin)强化学习方法进行优化,从而实现对目标Top-$K$性能的显式控制。在四个真实数据集上的实验验证了理论分析,并持续取得最优性能。