Recent advancements in Large Language Models (LLMs) and related technologies such as Retrieval-Augmented Generation (RAG) and Diagram of Thought (DoT) have enabled the creation of autonomous intelligent systems capable of performing cluster diagnostics and troubleshooting. By integrating these technologies with self-play methodologies, we have developed an LLM-agent system designed to autonomously diagnose and resolve issues within AI clusters. Our innovations include a knowledge base tailored for cluster diagnostics, enhanced LLM algorithms, practical deployment strategies for agents, and a benchmark specifically designed for evaluating LLM capabilities in this domain. Through extensive experimentation across multiple dimensions, we have demonstrated the superiority of our system in addressing the challenges faced in cluster diagnostics, particularly in detecting and rectifying performance issues more efficiently and accurately than traditional methods.
翻译:近年来,大型语言模型(LLMs)及相关技术(如检索增强生成(RAG)和思维图(DoT))的进展,使得构建能够执行集群诊断与故障排除的自主智能系统成为可能。通过将这些技术与自博弈方法相结合,我们开发了一个LLM-agent系统,旨在自主诊断并解决AI集群中的问题。我们的创新包括:一个为集群诊断定制的知识库、增强的LLM算法、实用的智能体部署策略,以及一个专门用于评估LLM在此领域能力的基准。通过多维度的大量实验,我们证明了我们的系统在应对集群诊断挑战方面的优越性,特别是在比传统方法更高效、更准确地检测和纠正性能问题方面。