Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to "reward hacking." On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.
翻译:构建一个同时满足低延迟和高性能的搜索相关性模型,一直是搜索行业面临的长期挑战。为了满足在线系统毫秒级的响应要求,同时保留大语言模型(LLMs)可解释的推理轨迹,我们提出了一种新颖的**先回答后推理(AFRL)**范式。该范式要求模型在第一个令牌就输出确定的相关性分数,随后提供一个结构化的逻辑解释。受推理模型成功的启发,我们采用"监督微调(SFT)+ 强化学习(RL)"的流程来实现AFRL。然而,直接将现有的RL训练应用于搜索相关性任务,常常会导致**模式崩溃**,即模型为了追求高奖励而遗忘复杂的长尾规则。从信息论的角度来看:RL本质上最小化**反向KL散度**,这倾向于寻找概率峰值(模式寻求),并且容易导致"奖励破解"。另一方面,SFT最小化**前向KL散度**,迫使模型覆盖数据分布(模式覆盖),从而有效地锚定专家规则。基于这一洞见,我们提出了一种**模式平衡优化**策略,将SFT辅助损失整合到Stepwise-GRPO训练中,以平衡这两种特性。此外,我们构建了一个自动化指令演化系统和一个多阶段课程学习机制,以确保专家级的数据质量。大量实验表明,我们的320亿参数教师模型达到了最先进的性能。更重要的是,AFRL架构支持高效的知识蒸馏,成功地将专家级逻辑迁移到一个6亿参数的模型中,从而在推理深度与部署延迟之间取得了平衡。