Performance debugging in production is a fundamental activity in modern service-based systems. The diagnosis of performance issues is often time-consuming, since it requires thorough inspection of large volumes of traces and performance indices. In this paper we present DeLag, a novel automated search-based approach for diagnosing performance issues in service-based systems. DeLag identifies subsets of requests that show, in the combination of their Remote Procedure Call execution times, symptoms of potentially relevant performance issues. We call such symptoms Latency Degradation Patterns. DeLag simultaneously searches for multiple latency degradation patterns while optimizing precision, recall and latency dissimilarity. Experimentation on 700 datasets of requests generated from two microservice-based systems shows that our approach provides better and more stable effectiveness than three state-of-the-art approaches and general purpose machine learning clustering algorithms. DeLag is more effective than all baseline techniques in at least one case study (with p $\leq$ 0.05 and non-negligible effect size). Moreover, DeLag outperforms in terms of efficiency the second and the third most effective baseline techniques on the largest datasets used in our evaluation (up to 22%).
翻译:摘要:生产环境下的性能调试是现代服务型系统的基础活动。性能问题的诊断通常耗时,因为需要全面检查大量追踪数据和性能指标。本文提出DeLag,一种新颖的自动化搜索方法,用于诊断服务型系统中的性能问题。DeLag通过组合远程过程调用执行时间,识别出展现潜在相关性能问题症状的请求子集,我们将此类症状称为延迟退化模式。DeLag同时搜索多种延迟退化模式,并优化精确度、召回率和延迟差异度。在基于两个微服务系统生成的700个请求数据集上的实验表明,与三种现有先进方法和通用机器学习聚类算法相比,我们的方法能提供更优且更稳定的有效性。在至少一个案例研究中(p ≤ 0.05且效应量不可忽略),DeLag在所有基线技术中表现更优。此外,在评估所使用的最大数据集上,DeLag在效率方面超过第二和第三有效的基线技术(提升幅度达22%)。