Large language model-specific inference engines (in short as \emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, commonality, fix effort, fix strategies, and temporal evolution. Our findings reveal six bug symptom types and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers, along with general guidelines for developing LLM inference engines.
翻译:大型语言模型专用推理引擎(简称\emph{LLM推理引擎})已成为现代人工智能基础设施的核心组件,支撑着LLM驱动应用程序(LLM应用)在云端和本地设备上的部署。尽管其作用至关重要,但由于LLM巨大的资源需求以及跨平台兼容性的复杂性,LLM推理引擎极易出现缺陷。然而,目前对这些缺陷的系统性理解仍显不足。为填补这一空白,我们首次对LLM推理引擎中的缺陷进行了实证研究。我们挖掘了5个广泛采用的LLM推理引擎的官方代码库,构建了一个包含929个真实世界缺陷的全面数据集。通过严格的开放式编码流程,我们分析了这些缺陷,揭示了其症状、根本原因、共性、修复工作量、修复策略及时间演化规律。我们的研究发现六类缺陷症状,并构建了包含28种根本原因的分类体系,从而阐明了LLM推理引擎中缺陷检测与定位的关键挑战。基于这些发现,我们为研究人员、推理引擎供应商和LLM应用开发者提出了一系列可操作的启示,并给出了开发LLM推理引擎的通用指导原则。