This paper introduces a scalable Anomaly Detection Service with a generalizable API tailored for industrial time-series data, designed to assist Site Reliability Engineers (SREs) in managing cloud infrastructure. The service enables efficient anomaly detection in complex data streams, supporting proactive identification and resolution of issues. Furthermore, it presents an innovative approach to anomaly modeling in cloud infrastructure by utilizing Large Language Models (LLMs) to understand key components, their failure modes, and behaviors. A suite of algorithms for detecting anomalies is offered in univariate and multivariate time series data, including regression-based, mixture-model-based, and semi-supervised approaches. We provide insights into the usage patterns of the service, with over 500 users and 200,000 API calls in a year. The service has been successfully applied in various industrial settings, including IoT-based AI applications. We have also evaluated our system on public anomaly benchmarks to show its effectiveness. By leveraging it, SREs can proactively identify potential issues before they escalate, reducing downtime and improving response times to incidents, ultimately enhancing the overall customer experience. We plan to extend the system to include time series foundation models, enabling zero-shot anomaly detection capabilities.
翻译:本文介绍了一种可扩展的异常检测服务,其配备通用化API并专为工业时间序列数据设计,旨在协助站点可靠性工程师管理云基础设施。该服务能够高效检测复杂数据流中的异常,支持对问题的主动识别与解决。此外,本文提出了一种创新的云基础设施异常建模方法,通过利用大型语言模型来理解关键组件、其故障模式及行为特征。我们提供了一套适用于单变量与多变量时间序列数据的异常检测算法,包括基于回归、混合模型及半监督的方法。我们深入分析了该服务的使用模式,其在一年内已服务超过500名用户,处理了20万次API调用。该服务已成功应用于包括基于物联网的人工智能应用在内的多种工业场景。我们还在公开异常基准测试上评估了系统性能,证明了其有效性。通过运用该服务,站点可靠性工程师能够在潜在问题升级前主动识别,从而减少停机时间、提升事件响应速度,最终改善整体客户体验。我们计划扩展该系统以纳入时间序列基础模型,实现零样本异常检测能力。