Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.
翻译:实时检测和缓解技术异常对于大规模云原生服务至关重要,即使数分钟的停机也可能导致巨额财务损失并降低用户信任。尽管客户事件是发现监控遗漏风险的关键信号,但由于数据极度嘈杂、吞吐量高以及不同业务线的语义复杂性,从中提取可操作情报仍面临挑战。本文提出TingIS——一个面向企业级事件发现的端到端系统。其核心是多阶段事件关联引擎,该引擎将高效索引技术与大型语言模型(LLMs)协同结合,在事件合并中做出明智决策,从而仅从少量多样化的用户描述中稳定提取可操作事件。该引擎辅以级联路由机制实现精确业务归因,以及集成领域知识、统计模式和行为过滤的多维降噪流水线。部署在峰值吞吐量超过每分钟2,000条消息、每日30万条消息的生产环境中,TingIS实现了P90告警延迟3.5分钟,高优先级事件发现率达95%。基于真实数据构建的基准测试表明,TingIS在路由准确性、聚类质量和信噪比方面显著优于基线方法。