As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause localization, and resolution.
翻译:随着大语言模型推理能力的持续进步,基于大语言模型的智能体系统相较于传统系统在灵活性和可解释性方面展现出优势,日益受到关注。然而,尽管智能体系统在学术研究和工业应用领域备受瞩目,但与传统系统类似,这类系统也频繁遭遇异常状况。这些异常导致系统不稳定和不安全,阻碍了其进一步发展。因此,迫切需要一套全面且系统化的智能体系统运维方法。遗憾的是,当前针对智能体系统运维的研究尚显匮乏。为填补这一空白,我们开展了关于智能体系统运维的综述研究,旨在为该领域构建清晰的框架,明确相关挑战,并促进其进一步发展。具体而言,本文首先系统性地定义了智能体系统中的异常,将其划分为智能体内部异常和智能体间异常。随后,我们提出了一种新颖且全面的智能体系统运维框架,即智能体系统运维(AgentOps)。我们详细定义并阐释了该框架的四个关键阶段:监控、异常检测、根因定位以及问题修复。