TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance

Query-product relevance prediction is fundamental to e-commerce search and has become even more critical in the era of AI-powered shopping, where semantic understanding and complex reasoning directly shape the user experience and business conversion. Large Language Models (LLMs) enable generative, reasoning-based approaches, typically aligned via supervised fine-tuning (SFT) or preference optimization methods like Direct Preference Optimization (DPO). However, the increasing complexity of business rules and user queries exposes the inability of existing methods to endow models with robust reasoning capacity for long-tail and challenging cases. Efforts to address this via reinforcement learning strategies like Group Relative Policy Optimization (GRPO) often suffer from sparse terminal rewards, offering insufficient guidance for multi-step reasoning and slowing convergence. To address these challenges, we propose TaoSR-AGRL, an Adaptive Guided Reinforcement Learning framework for LLM-based relevance prediction in Taobao Search Relevance. TaoSR-AGRL introduces two key innovations: (1) Rule-aware Reward Shaping, which decomposes the final relevance judgment into dense, structured rewards aligned with domain-specific relevance criteria; and (2) Adaptive Guided Replay, which identifies low-accuracy rollouts during training and injects targeted ground-truth guidance to steer the policy away from stagnant, rule-violating reasoning patterns toward compliant trajectories. TaoSR-AGRL was evaluated on large-scale real-world datasets and through online side-by-side human evaluations on Taobao Search. It consistently outperforms DPO and standard GRPO baselines in offline experiments, improving relevance accuracy, rule adherence, and training stability. The model trained with TaoSR-AGRL has been successfully deployed in the main search scenario on Taobao, serving hundreds of millions of users.

翻译：查询-商品相关性预测是电商搜索的基础，在AI驱动的购物时代变得尤为关键，其中语义理解和复杂推理直接影响用户体验和商业转化。大语言模型（LLM）支持基于生成和推理的方法，通常通过监督微调（SFT）或直接偏好优化（DPO）等偏好优化方法进行对齐。然而，日益复杂的业务规则和用户查询暴露出现有方法无法赋予模型针对长尾和挑战性案例的鲁棒推理能力。通过如组相对策略优化（GRPO）等强化学习策略来解决此问题的尝试，常常面临终端奖励稀疏的问题，无法为多步推理提供充分指导，并减缓收敛速度。为应对这些挑战，我们提出了TaoSR-AGRL，一个用于淘宝搜索相关性中基于LLM的相关性预测的自适应引导强化学习框架。TaoSR-AGRL引入了两项关键创新：（1）规则感知奖励塑形，将最终的相关性判断分解为与领域特定相关性标准对齐的、密集的结构化奖励；（2）自适应引导回放，在训练过程中识别低准确率的推演轨迹，并注入有针对性的真实情况指导，引导策略远离停滞的、违反规则的推理模式，转向合规的轨迹。TaoSR-AGRL在大规模真实世界数据集上进行了评估，并通过淘宝搜索的在线并行人工评估进行了验证。在离线实验中，它持续优于DPO和标准GRPO基线，提升了相关性准确性、规则遵循度和训练稳定性。使用TaoSR-AGRL训练的模型已成功部署在淘宝主搜索场景中，为数亿用户提供服务。