Microservice resilience, the ability of microservices to recover from failures and continue providing reliable and responsive services, is crucial for cloud vendors. However, the current practice relies on manually configured rules specific to a certain microservice system, resulting in labor-intensity and flexibility issues, given the large scale and high dynamics of microservices. A more labor-efficient and versatile solution is desired. Our insight is that resilient deployment can effectively prevent the dissemination of degradation from system performance metrics to user-aware metrics, and the latter directly affects service quality. In other words, failures in a non-resilient deployment can impact both types of metrics, leading to user dissatisfaction. With this in mind, we propose MicroRes, the first versatile resilience profiling framework for microservices via degradation dissemination indexing. MicroRes first injects failures into microservices and collects available monitoring metrics. Then, it ranks the metrics according to their contributions to the overall service degradation. It produces a resilience index by how much the degradation is disseminated from system performance metrics to user-aware metrics. Higher degradation dissemination indicates lower resilience. We evaluate MicroRes on two open-source and one industrial microservice system. The experiments show MicroRes' efficient and effective resilience profiling of microservices. We also showcase MicroRes' practical usage in production.
翻译:微服务弹性,即微服务从故障中恢复并持续提供可靠和响应式服务的能力,对云服务提供商至关重要。然而,当前实践依赖于针对特定微服务系统手动配置的规则,考虑到微服务的大规模和高动态性,这导致了高劳动强度和灵活性不足的问题。因此,需要一种更高效且通用的解决方案。我们的洞察是:弹性部署可以有效防止退化从系统性能指标传播到用户感知指标,而后者直接影响服务质量。换言之,在非弹性部署中,故障可能同时影响两类指标,进而导致用户不满。基于这一思路,我们提出了MicroRes——首个通过退化传播索引实现微服务通用弹性分析的框架。MicroRes首先向微服务注入故障并收集可用的监控指标,然后根据各指标对整体服务退化的贡献程度进行排序,最后通过计算退化从系统性能指标传播到用户感知指标的程度来生成弹性指数。退化传播程度越高,表明弹性越低。我们在两个开源系统和一个工业级微服务系统上评估了MicroRes。实验结果表明,MicroRes能够高效且有效地分析微服务弹性。我们还展示了MicroRes在生产环境中的实际应用。