Answer First, Reason Later: Aligning Search Relevance via Mode-Balanced Reinforcement Learning

Building a search relevance model that achieves both low latency and high performance is a long-standing challenge in the search industry. To satisfy the millisecond-level response requirements of online systems while retaining the interpretable reasoning traces of Large Language Models (LLMs), we propose a novel \textbf{Answer-First, Reason Later (AFRL)} paradigm. This paradigm requires the model to output the definitive relevance score in the very first token, followed by a structured logical explanation. Inspired by the success of reasoning models, we adopt a "Supervised Fine-Tuning (SFT) + Reinforcement Learning (RL)" pipeline to achieve AFRL. However, directly applying existing RL training often leads to \textbf{mode collapse} in the search relevance task, where the model forgets complex long-tail rules in pursuit of high rewards. From an information theory perspective: RL inherently minimizes the \textbf{Reverse KL divergence}, which tends to seek probability peaks (mode-seeking) and is prone to "reward hacking." On the other hand, SFT minimizes the \textbf{Forward KL divergence}, forcing the model to cover the data distribution (mode-covering) and effectively anchoring expert rules. Based on this insight, we propose a \textbf{Mode-Balanced Optimization} strategy, incorporating an SFT auxiliary loss into Stepwise-GRPO training to balance these two properties. Furthermore, we construct an automated instruction evolution system and a multi-stage curriculum to ensure expert-level data quality. Extensive experiments demonstrate that our 32B teacher model achieves state-of-the-art performance. Moreover, the AFRL architecture enables efficient knowledge distillation, successfully transferring expert-level logic to a 0.6B model, thereby reconciling reasoning depth with deployment latency.

翻译：构建一个同时满足低延迟和高性能的搜索相关性模型，一直是搜索行业面临的长期挑战。为了满足在线系统毫秒级的响应要求，同时保留大语言模型（LLMs）可解释的推理轨迹，我们提出了一种新颖的**先回答后推理（AFRL）**范式。该范式要求模型在第一个令牌就输出确定的相关性分数，随后提供一个结构化的逻辑解释。受推理模型成功的启发，我们采用"监督微调（SFT）+ 强化学习（RL）"的流程来实现AFRL。然而，直接将现有的RL训练应用于搜索相关性任务，常常会导致**模式崩溃**，即模型为了追求高奖励而遗忘复杂的长尾规则。从信息论的角度来看：RL本质上最小化**反向KL散度**，这倾向于寻找概率峰值（模式寻求），并且容易导致"奖励破解"。另一方面，SFT最小化**前向KL散度**，迫使模型覆盖数据分布（模式覆盖），从而有效地锚定专家规则。基于这一洞见，我们提出了一种**模式平衡优化**策略，将SFT辅助损失整合到Stepwise-GRPO训练中，以平衡这两种特性。此外，我们构建了一个自动化指令演化系统和一个多阶段课程学习机制，以确保专家级的数据质量。大量实验表明，我们的320亿参数教师模型达到了最先进的性能。更重要的是，AFRL架构支持高效的知识蒸馏，成功地将专家级逻辑迁移到一个6亿参数的模型中，从而在推理深度与部署延迟之间取得了平衡。