Retrieval-GRPO: A Multi-Objective Reinforcement Learning Framework for Dense Retrieval in Taobao Search

Dense retrieval, as the core component of e-commerce search engines, maps user queries and items into a unified semantic space through pre-trained embedding models to enable large-scale real-time semantic retrieval. Despite the rapid advancement of LLMs gradually replacing traditional BERT architectures for embedding, their training paradigms still adhere to BERT-like supervised fine-tuning and hard negative mining strategies. This approach relies on complex offline hard negative sample construction pipelines, which constrain model iteration efficiency and hinder the evolutionary potential of semantic representation capabilities. Besides, existing multi-task learning frameworks face the seesaw effect when simultaneously optimizing semantic relevance and non-relevance objectives. In this paper, we propose Retrieval-GRPO, a multi-objective reinforcement learning-based dense retrieval framework designed to address these challenges. The method eliminates offline hard negative sample construction by dynamically retrieving Top-K candidate products for each query during training, while introducing a relevance LLM as a reward model to generate real-time feedback. Specifically, the retrieval model dynamically optimizes embedding representations through reinforcement learning, with reward signals combining LLM-generated relevance scores, product quality scores, and multi-way exclusivity metrics to achieve multi-objective user preference alignment and real-time error correction. This mechanism not only removes dependency on hard negatives but also mitigates the seesaw effect through collaborative multi-objective optimization, significantly enhancing the model's semantic generalization capability for complex long-tail queries. Extensive offline and online experiments validate the effectiveness of Retrieval-GRPO, which has been deployed on China's largest e-commerce platform.

翻译：稠密检索作为电子商务搜索引擎的核心组件，通过预训练嵌入模型将用户查询与商品映射至统一的语义空间，以实现大规模实时语义检索。尽管大型语言模型（LLM）的快速发展正逐步取代传统的BERT架构用于嵌入表示，其训练范式仍遵循类似BERT的监督微调与困难负样本挖掘策略。该方法依赖复杂的离线困难负样本构建流程，制约了模型迭代效率并阻碍了语义表示能力的进化潜力。此外，现有多任务学习框架在同时优化语义相关性目标与非相关性目标时面临跷跷板效应。本文提出检索-GRPO，一种基于多目标强化学习的稠密检索框架，旨在应对这些挑战。该方法通过在训练过程中动态检索每个查询的Top-K候选商品，消除了离线困难负样本构建需求，同时引入相关性LLM作为奖励模型以生成实时反馈。具体而言，检索模型通过强化学习动态优化嵌入表示，其奖励信号融合了LLM生成的相关性分数、商品质量分数及多路排他性指标，以实现多目标用户偏好对齐与实时误差修正。该机制不仅消除了对困难负样本的依赖，还通过协同多目标优化缓解了跷跷板效应，显著提升了模型对复杂长尾查询的语义泛化能力。大量离线与在线实验验证了检索-GRPO的有效性，该框架已在中国最大电子商务平台完成部署。