Search relevance modeling is a core task in e-commerce search systems, assessing how well a user query matches candidate products. Rather than relying on a single holistic matching signal, relevance judgment often requires structured reasoning over query understanding, product understanding, and facet-level matching. With large language models (LLMs), this process is increasingly formulated as chain-of-thought (CoT) reasoning and optimized with reinforcement learning (RL). However, existing RL methods mainly rely on outcome-level rewards and treat the entire reasoning chain as a single optimization unit. This makes it difficult to distinguish faulty reasoning steps from correct intermediate ones, leading to misaligned credit assignment. Although process-reward methods provide denser supervision, they often treat reasoning steps independently and ignore dependency-driven error propagation, making responsibility attribution difficult and limiting the optimization of structured relevance reasoning. We propose Graph-GRPO, a graph-structured extension of GRPO for multi-component relevance reasoning. Graph-GRPO constructs a relevance reasoning dependency graph, where CoT steps are modeled as nodes and their logical dependencies as edges. It propagates outcome-level rewards over the graph to derive step-level credit signals, enabling more accurate fine-grained credit assignment. We further introduce a main-loss-driven controller that adaptively adjusts edge-wise credit-propagation coefficients. Together with CoT random masking for supervised policy initialization and graph-node-based multi-head distillation, we build a trainable and deployable framework for generative relevance modeling. Extensive offline evaluations and online A/B tests on a leading e-commerce platform demonstrate that the Graph-GRPO-based framework improves relevance classification metrics and key engagement metrics.
翻译:摘要:搜索相关性建模是电商搜索系统的核心任务,用于评估用户查询与候选产品之间的匹配程度。相关性判断并不依赖单一的全局匹配信号,而是需要对查询理解、产品理解以及方面级匹配进行结构化推理。借助大语言模型,该过程逐渐被形式化为思维链推理,并通过强化学习进行优化。然而,现有的强化学习方法主要依赖结果级奖励,将整个推理链视为单一优化单元。这使得难以区分错误推理步骤与正确中间步骤,导致信用分配失调。尽管过程奖励方法提供了更密集的监督,但它们往往将推理步骤视为独立单元,忽略了依赖驱动的错误传播,使得责任归属困难且限制了结构化相关性推理的优化。我们提出Graph-GRPO,一种基于图结构扩展的GRPO方法,用于多组件相关性推理。Graph-GRPO构建了一个相关性推理依赖图,其中思维链步骤被建模为节点,其逻辑依赖关系被建模为边。它将结果级奖励在图结构上进行传播以推导出步骤级信用信号,从而实现更精确的细粒度信用分配。我们进一步引入主损失驱动控制器,自适应调整边级信用传播系数。结合用于监督策略初始化的思维链随机掩码以及基于图节点的多头蒸馏,我们构建了一个可训练且可部署的生成式相关性建模框架。在主流电商平台上进行的大量离线评估和在线A/B测试表明,基于Graph-GRPO的框架可改善相关性分类指标和关键交互指标。