Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies that violate this assumption. Our stratified analysis uncovers a consistent pattern: reward guidance is most beneficial when the policy exhibits uncertainty and the ranker can effectively discriminate the ground-truth item from rollout negatives. On other samples, the reward signal is either negligible or detrimental, highlighting the risk of uniform RL application. To address such an issue, we introduce AdaGRPO, a novel framework that treats reward-guided optimization as selective admission rather than uniform pressure. Training is anchored in supervised negative log-likelihood, while the GRPO objective is gated by a binary, per-sample clip determined by two rollout diagnostics: policy-side difficulty and reward discriminability. Instances failing either diagnostic default to pure supervision, ensuring stability and mitigating the amplification of noisy gradients. We validate AdaGRPO on a large-scale e-commerce dataset. At the best intermediate checkpoint, it elevates HR@10 from 11.01% to 12.18% while constraining hallucination below 0.22%, and maintains robustness at the final checkpoint (HR@10 11.63%, hallucination 0.27%), outperforming fixed NLL--GRPO mixtures across the retrieval--validity frontier. In production A/B tests, AdaGRPO achieves statistically significant gains in click-through rate and dwell time, confirming its practical utility.
翻译:强化学习为生成式推荐提供了一条超越监督式模仿的可行路径,通过利用奖励信号指导策略改进。然而,其有效性关键取决于奖励模型对所评估样本的可靠性。实践中,广泛采用的奖励模型——生产级排序器——是在存在曝光偏差的日志上训练的,这导致样本依赖性不准确,违反了上述假设。我们的分层分析揭示了一致模式:当策略表现出不确定性且排序器能有效区分真实物品与生成为负样本时,奖励指导最为有益。在其他样本上,奖励信号要么微不足道,要么有害,凸显了统一应用强化学习的风险。为解决此问题,我们提出AdaGRPO,一种新颖的框架,将奖励指导的优化视为选择性准入而非统一施加压力。训练以监督式负对数似然为根基,而GRPO目标通过基于两个生成为诊断指标(策略侧难度与奖励区分度)的逐样本二进制裁剪门控。未通过任一诊断的样本默认采用纯监督学习,确保稳定性并缓解噪声梯度的放大。我们在大规模电商数据集上验证AdaGRPO。在最佳中间检查点,它使HR@10从11.01%提升至12.18%,同时将幻觉率限制在0.22%以下,并在最终检查点保持鲁棒性(HR@10为11.63%,幻觉率为0.27%),在检索-有效性前沿上优于固定NLL-GRPO混合方案。在生产环境A/B测试中,AdaGRPO在点击率和停留时间上实现统计显著提升,证实其实用价值。