People and businesses increasingly rely on public LLM services, such as ChatGPT, DALLE, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.
翻译:随着个人与企业日益依赖ChatGPT、DALLE、Claude等公共LLM服务,理解其服务中断现象,特别是量化其故障恢复过程,已成为亟待解决的关键问题。然而,这一新兴领域目前仅有有限的研究成果。针对该问题,本研究对公共LLM服务的中断与故障恢复机制展开实证特征分析。我们收集并构建了涵盖3家主流LLM提供商(包括市场领导者OpenAI和Anthropic)共8项常用LLM服务的数据集,从故障恢复的统计特性、时序规律、共现现象及中断事件影响范围等维度进行深入解析。研究得出十余项重要发现,其中包括:(1)OpenAI的ChatGPT服务故障修复耗时更长但发生频率低于Anthropic的Claude服务;(2)OpenAI与Anthropic的服务故障呈现显著的周度与月度周期性;(3)OpenAI服务比Anthropic服务具有更优的故障隔离能力。本研究揭示了LLM故障的内在特征,为LLM系统的构建与使用优化提供了理论依据。相关FAIR数据与代码已公开于https://zenodo.org/records/14018219 与 https://github.com/atlarge-research/llm-service-analysis。