Modern microservice systems have gained widespread adoption due to their high scalability, flexibility, and extensibility. However, the characteristics of independent deployment, decentralization, and frequent dynamic interactions also introduce the risk of cascading failures, making it challenging to achieve accurate failure diagnosis and rapid system recovery. These issues severely impact operation efficiency and user experience. Recognizing the crucial role of failure diagnosis in enhancing the stability and reliability of microservice systems, researchers have conducted extensive studies and achieved a series of significant outcomes. This survey provides a comprehensive review and primary analysis of 94 papers from 2003 to the present, including an overview of the fundamental concepts, a research framework, and problem statements. These insights aim to help researchers understand the latest research progress in failure diagnosis. Publicly available datasets, toolkits, and evaluation metrics are also compiled to assist practitioners in selecting and validating various techniques, providing a foundation to advance the domain beyond current practices.
翻译:现代微服务系统因其高可扩展性、灵活性和可扩展性而得到广泛应用。然而,其独立部署、去中心化和频繁动态交互的特性也引入了级联故障的风险,使得实现准确的故障诊断和快速系统恢复变得具有挑战性。这些问题严重影响了运行效率和用户体验。认识到故障诊断在提升微服务系统稳定性和可靠性方面的关键作用,研究人员已开展广泛研究并取得了一系列重要成果。本综述对2003年至今的94篇论文进行了全面回顾和初步分析,包括对基本概念、研究框架和问题陈述的概述。这些见解旨在帮助研究者了解故障诊断的最新研究进展。同时,本文还整理了公开可用的数据集、工具包和评估指标,以协助从业者选择和验证各种技术,为超越当前实践、推动该领域发展提供基础。