Cloud services are omnipresent and critical cloud service failure is a fact of life. In order to retain customers and prevent revenue loss, it is important to provide high reliability guarantees for these services. One way to do this is by predicting outages in advance, which can help in reducing the severity as well as time to recovery. It is difficult to forecast critical failures due to the rarity of these events. Moreover, critical failures are ill-defined in terms of observable data. Our proposed method, Outage-Watch, defines critical service outages as deteriorations in the Quality of Service (QoS) captured by a set of metrics. Outage-Watch detects such outages in advance by using current system state to predict whether the QoS metrics will cross a threshold and initiate an extreme event. A mixture of Gaussian is used to model the distribution of the QoS metrics for flexibility and an extreme event regularizer helps in improving learning in tail of the distribution. An outage is predicted if the probability of any one of the QoS metrics crossing threshold changes significantly. Our evaluation on a real-world SaaS company dataset shows that Outage-Watch significantly outperforms traditional methods with an average AUC of 0.98. Additionally, Outage-Watch detects all the outages exhibiting a change in service metrics and reduces the Mean Time To Detection (MTTD) of outages by up to 88% when deployed in an enterprise cloud-service system, demonstrating efficacy of our proposed method.
翻译:云服务无处不在,关键云服务故障是不可避免的现实。为留住客户并防止收入损失,为这些服务提供高可靠性保障至关重要。实现这一目标的方法之一是提前预测服务中断,这有助于降低故障严重程度并缩短恢复时间。由于关键故障事件极为罕见,且其可观察数据的定义不明确,因此预测难度较大。我们提出的方法Outage-Watch将关键服务中断定义为由一组指标捕获的服务质量(QoS)退化。Outage-Watch通过当前系统状态预测QoS指标是否将超过阈值并触发极端事件,从而提前检测此类中断。该方法采用高斯混合模型灵活地对QoS指标分布进行建模,并利用极端事件正则化器改善分布尾部的学习效果。若任一QoS指标超过阈值的概率发生显著变化,则预测将发生服务中断。在真实SaaS公司数据集上的评估表明,Outage-Watch的平均AUC达到0.98,显著优于传统方法。此外,当部署于企业级云服务系统时,Outage-Watch能够检测所有表现出服务指标变化的中断事件,并将服务中断的平均检测时间(MTTD)降低高达88%,验证了所提方法的有效性。