People and businesses increasingly rely on public LLM services, such as ChatGPT, DALLE, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.
翻译:随着ChatGPT、DALLE和Claude等公共大型语言模型(LLM)服务在个人与企业中的依赖度日益提升,理解其服务中断现象,特别是量化其故障恢复过程,已成为一个亟待解决的关键问题。然而,这一新兴领域目前仅有有限的研究。针对该问题,本研究对公共LLM服务的中断与故障恢复进行了实证特征分析。我们收集并整理了来自3家主流LLM提供商(包括市场领先的OpenAI和Anthropic)共8项常用LLM服务的数据集。通过对故障恢复的统计特性、时间模式、共现性以及导致中断的事件影响范围进行详细分析,我们获得了十余项重要发现,其中包括:(1)OpenAI的ChatGPT服务故障恢复时间较长,但发生频率低于Anthropic的Claude服务;(2)OpenAI与Anthropic的服务故障呈现出显著的周度和月度周期性;(3)OpenAI服务在故障隔离方面优于Anthropic服务。本研究揭示了LLM故障的特征,从而为构建和使用LLM系统提供了优化依据。相关FAIR数据与代码已公开于https://zenodo.org/records/14018219 与 https://github.com/atlarge-research/llm-service-analysis。