The increasing complexity and usage of cloud systems have made it challenging for service providers to ensure reliability. This paper highlights two main challenges, namely internal and external factors, that affect the reliability of cloud microservices. Afterward, we discuss the data-driven approach that can resolve these challenges from four key aspects: ticket management, log management, multimodal analysis, and the microservice resilience testing approach. The experiments conducted show that the proposed data-driven AIOps solution significantly enhances system reliability from multiple angles.
翻译:云计算系统日益增长的复杂性和使用规模,使得服务提供商难以确保其可靠性。本文重点阐述了影响云微服务可靠性的两大挑战,即内部因素与外部因素。随后,我们从四个关键方面探讨了应对这些挑战的数据驱动方法:工单管理、日志管理、多模态分析以及微服务韧性测试方法。实验结果表明,所提出的数据驱动型AIOps解决方案从多个角度显著提升了系统可靠性。