Accurately modeling query-item relevance drives e-commerce ranking, yet long-tail, knowledge-heavy, and fast-evolving queries exceed parametric LLM coverage. External context (reviews, attribute encyclopedias, UGC) can help but is noisy, and single-pass latency and cost forbid any clean-then-summarize step. The model must, per query, judge relevance and decide whether to use, partially use, or ignore the context. DyKnow-RAG is a dynamic noisy-RAG framework built on Group Relative Policy Optimization. It trains two rollout groups (no external context vs a single retrieved chunk) and applies posterior-driven inter-group advantage scaling that adaptively reweights their contributions by the per-query correctness gap. This teaches when to trust retrieval versus fall back to parametric knowledge, without process labels, value networks, or extra inference passes, preserving single-pass, single-chunk deployment under production latency. Training combines: (1) supervised initialization with a structured rationale that explicitly records the context-usage decision; (2) an RL pool prioritized by SFT uncertainty to focus where context choice is most consequential; and (3) an optional lightweight DPO warm start to stabilize with-context calibration. Under a unified retrieval/index and fixed latency budget, DyKnow-RAG outperforms SFT, DPO, and vanilla GRPO in offline tests, and delivers consistent lifts on GSB, Query Goodrate, and Item Goodrate in Taobao A/B testing. It is deployed in Taobao's production relevance system, serving live traffic. To our knowledge, it is among the first single-pass RAG solutions for e-commerce relevance, turning noisy external signals into reliable gains without added online complexity.
翻译:准确建模查询-商品相关性是驱动电商排序的关键,然而长尾、知识密集且快速演变的查询超出了参数化大语言模型的覆盖范围。外部上下文(如评论、属性百科、用户生成内容)虽能提供帮助,但往往包含噪声,且单次推理的延迟和成本限制使得任何先清洗后总结的步骤都不可行。模型必须针对每个查询,判断相关性并决定是使用、部分使用还是忽略上下文。DyKnow-RAG 是一个基于组相对策略优化的动态噪声检索增强生成框架。它训练两个推演组(无外部上下文 vs 单个检索片段),并应用后验驱动的组间优势缩放,通过每个查询的正确性差距自适应地重新加权它们的贡献。这种方法教会模型何时信任检索结果、何时回退到参数化知识,无需过程标签、价值网络或额外推理步骤,保持了生产延迟要求下的单次、单片段部署。训练过程结合了:(1)带有结构化推理的监督初始化,明确记录上下文使用决策;(2)基于监督微调不确定度优先采样的强化学习池,聚焦于上下文选择影响最大的样本;(3)可选的轻量级直接偏好优化预热启动,以稳定带上下文的校准。在统一的检索/索引架构和固定延迟预算下,DyKnow-RAG 在离线测试中优于监督微调、直接偏好优化和原始组相对策略优化,并在淘宝 A/B 测试中持续提升了广义满意度指标、查询满意率和商品满意率。该框架已部署于淘宝生产相关性系统,服务线上流量。据我们所知,这是首批用于电商相关性的单次检索增强生成解决方案之一,能够将噪声外部信号转化为可靠收益,而无需增加线上复杂度。