Multi-tenancy in public clouds may lead to co-location interference on shared resources, which possibly results in performance degradation of cloud applications. Cloud providers want to know when such events happen and how serious the degradation is, to perform interference-aware migrations and alleviate the problem. However, virtual machines (VM) in Infrastructure-as-a-Service public clouds are black-boxes to providers, where application-level performance information cannot be acquired. This makes performance monitoring intensely challenging as cloud providers can only rely on low-level metrics such as CPU usage and hardware counters. We propose a novel machine learning framework, Alioth, to monitor the performance degradation of cloud applications. To feed the data-hungry models, we first elaborate interference generators and conduct comprehensive co-location experiments on a testbed to build Alioth-dataset which reflects the complexity and dynamicity in real-world scenarios. Then we construct Alioth by (1) augmenting features via recovering low-level metrics under no interference using denoising auto-encoders, (2) devising a transfer learning model based on domain adaptation neural network to make models generalize on test cases unseen in offline training, and (3) developing a SHAP explainer to automate feature selection and enhance model interpretability. Experiments show that Alioth achieves an average mean absolute error of 5.29% offline and 10.8% when testing on applications unseen in the training stage, outperforming the baseline methods. Alioth is also robust in signaling quality-of-service violation under dynamicity. Finally, we demonstrate a possible application of Alioth's interpretability, providing insights to benefit the decision-making of cloud operators. The dataset and code of Alioth have been released on GitHub.
翻译:公共云中的多租户可能导致共享资源上的共置干扰,进而引发云应用性能下降。云提供商希望获知此类事件的发生时间及严重程度,以便执行干扰感知迁移并缓解问题。然而,在基础设施即服务公共云中,虚拟机对提供商而言是黑盒,无法获取应用级性能信息。这使得性能监控极具挑战性,因为云提供商只能依赖CPU使用率和硬件计数器等低级指标。我们提出了一种新颖的机器学习框架Alioth,用于监控云应用的性能降级。为满足数据饥渴型模型的需求,我们首先设计了干扰生成器,并在测试平台上开展全面的共置实验,构建了反映真实场景复杂性和动态性的Alioth数据集。随后通过以下方法构建Alioth:(1)利用去噪自编码器恢复无干扰状态下的低级指标以增强特征;(2)设计基于域自适应神经网络的迁移学习模型,使模型能够泛化至离线训练中未见过的测试案例;(3)开发SHAP解释器以自动化特征选择并增强模型可解释性。实验表明,Alioth在离线场景下平均绝对误差为5.29%,在训练阶段未见应用上测试时达10.8%,优于基准方法。Alioth在动态性下对服务质量违例的预警亦具有鲁棒性。最后,我们展示了Alioth可解释性的潜在应用,为云运营商的决策提供有益洞察。Alioth的数据集和代码已在GitHub上发布。