Detecting and resolving performance anomalies in Cloud services is crucial for maintaining desired performance objectives. Scaling actions triggered by an anomaly detector help achieve target latency at the cost of extra resource consumption. However, performance anomaly detectors make mistakes. This paper studies which characteristics of performance anomaly detection are important to optimize the trade-off between performance and cost. Using Stochastic Reward Nets, we model a Cloud service monitored by a performance anomaly detector. Using our model, we study the impact of detector characteristics, namely precision, recall and inspection frequency, on the average latency and resource consumption of the monitored service. Our results show that achieving a high precision and a high recall is not always necessary. If detection can be run frequently, a high precision is enough to obtain a good performance-to-cost trade-off, but if the detector is run infrequently, recall becomes the most important.
翻译:检测并解决云服务中的性能异常对于维持预期性能目标至关重要。由异常检测器触发的扩缩容操作有助于实现目标延迟,但会以额外资源消耗为代价。然而,性能异常检测器存在误判。本文研究了性能异常检测的哪些特性对优化性能与成本之间的权衡至关重要。我们利用随机回报网对受性能异常检测器监控的云服务进行建模。通过该模型,我们分析了检测器特性——即精确率、召回率和检测频率——对受监控服务的平均延迟和资源消耗的影响。研究结果表明,同时实现高精确率和高召回率并非总是必要的。若检测可频繁执行,高精确率足以获得良好的性能-成本权衡;但若检测器运行频率较低,召回率则成为最关键的因素。