CoRe: A Continuously Reward-Finetuned LLM Query Rewriter for Multi-Stage Context-Aware Relevance in Web-Scale Video Search

Yilin Wen,Rong Yang,Xiaojia Chang,Hong Sun,Gefu Tang,Chunhui Liu,Jeffrey Chen,Zeyu Ma,Lisong Qiu,Xiaochuan Fan,Congjia Yu,Quan Zhou,Yuheng Chen,Zian Wang

from arxiv, 12 pages, 3 figures

LLM-based query rewriters in production face a tension: the training reward must reflect how the rewrite is consumed by the production ranker, yet the training procedure must be cheap enough to support continuous redeployment as data drifts. We present CoRe (Context Relevance), such a system, redeployed weekly for over five months in a major short-video search engine. Our reward uses the deployed multimodal relevance model as its source and a multiplicative ratio form mirroring the production fusion algebra, closing the simulation-production gap that offline reward proxies leave open. A semi-online Mixed Preference Optimization loop makes this reward affordable at multi-million-instance weekly scale: a DPO-style pairwise objective restricts the gradient pass to a small top-k/bottom-k subset of sampled trajectories, and a phase structure reduces trainer/inference-server parameter syncs from per-step to per-phase. An automated promotion gate over reward-like and stability metrics detected and recovered from a real reward-hacking incident in production. Rewriter output is consumed as parallel relevance signals at recall, rawrank, and finerank without displacing the original signals, bounding rewriter-failure blast radius. Online A/B from two sequential production launches, first deploying the rewriter at finerank, then extending consumption to recall and rawrank, delivers statistically significant reductions in change-query rate on rewrite-impacted queries, with all headline relevance and engagement metrics moving in the expected direction.

翻译：生产环境中的大模型查询重写器面临着一组矛盾：训练奖励必须反映重写结果被生产排序器消费的方式，同时训练过程还需足够廉价以支持数据漂移下的持续部署。本文提出CoRe（上下文相关性）系统，该系统已在某大型短视频搜索引擎中持续重部署超过五个月。我们采用已部署的多模态相关性模型作为奖励来源，并使用乘性比例形式以镜像生产融合代数，从而弥合离线奖励代理所遗留的模拟-生产差距。半在线混合偏好优化循环使得该奖励在多百万实例的周级别规模下保持可行：DPO风格的成对目标将梯度传递限制在采样轨迹的少量top-k/bottom-k子集上，而阶段式结构将训练器/推理服务器参数同步从每步降低至每阶段。自动上线门控机制基于类奖励和稳定性指标，成功检测并从生产中的一次真实奖励篡改事件中恢复。重写器输出被作为并行相关性信号在召回、粗排和精排阶段消费，且不取代原始信号，从而限制重写器故障的波及范围。通过两次连续的生产部署在线A/B实验（首次将重写器部署于精排阶段，随后将其消费扩展至召回和粗排阶段），在受重写影响的查询上展现了统计显著的更改查询率降低，所有关键相关性及用户参与度指标均朝预期方向移动。