Operating and maintaining (O&M) large-scale online engine systems (search, recommendation, advertising) demands substantial human effort for release monitoring, alert response, and root cause analysis. While LLM-based agents are a natural fit for these tasks, the deployment bottleneck is not reasoning capability but orchestration: selecting, for each operational event, the relevant data (metrics, logs, change events) and the applicable operational knowledge (handbook rules and practitioner experience). Feeding all signals indiscriminately causes dilution and hallucination, while manually curating the event-to-(data, knowledge) mapping is intractable under dozens of daily releases. We present Bian Que, an agentic framework with three contributions: (i) a \emph{unified operational paradigm} abstracting day-to-day O&M into three canonical patterns: release interception, proactive inspection, and alert root cause analysis; (ii) \emph{Flexible Skill Arrangement}, where each Skill specifies which data and knowledge to retrieve for a given business-module context and can be automatically generated and updated by LLMs or iteratively refined through natural-language instructions from on-call engineers; (iii) a \emph{unified self-evolving mechanism} in which one correction signal drives two parallel pathways, case-memory-to-knowledge distillation and targeted Skill refinement. Deployed on the e-commerce search engine of KuaiShou, the major short-video platform in China, Bian Que reduces alert volume by 75%, achieves 80% root-cause analysis accuracy, and cuts mean time to resolution by over 50%. Our framework achieves 99.0% pass rate on offline evaluations. Our code is available at https://github.com/benchen4395/BianQue_Assistant.
翻译:大型在线引擎系统(搜索、推荐、广告)的运维工作需要大量人力进行发布监控、告警响应和根因分析。尽管基于大语言模型的智能体天然适合这些任务,但部署瓶颈并非推理能力而是编排能力:针对每个运营事件,需选择相关数据(指标、日志、变更事件)和适用运维知识(手册规则和实践经验)。不加区分地输入所有信号会导致信息稀释和幻觉,而手动为事件到(数据、知识)的映射进行编排在每日数十次发布的场景下不可行。我们提出扁鹊,一个智能体框架,包含三项贡献:(i)统一运维范式,将日常运维抽象为三种经典模式:发布拦截、主动巡检和告警根因分析;(ii)灵活技能编排,每个技能指定在给定业务模块上下文中需检索的数据和知识,可通过LLM自动生成和更新,或由值班工程师通过自然语言指令迭代优化;(iii)统一自进化机制,单个修正信号驱动两条并行路径:案例记忆到知识蒸馏以及针对性技能细化。该框架部署于中国最大短视频平台快手的电商搜索引擎上,使告警量减少75%,根因分析准确率达80%,平均修复时间降低超过50%。我们的框架在离线评估中达到99.0%的通过率。代码开源于https://github.com/benchen4395/BianQue_Assistant。