Due to the continuous change in operational data, AIOps solutions suffer from performance degradation over time. Although periodic retraining is the state-of-the-art technique to preserve the failure prediction AIOps models' performance over time, this technique requires a considerable amount of labeled data to retrain. In AIOps obtaining label data is expensive since it requires the availability of domain experts to intensively annotate it. In this paper, we present McUDI, a model-centric unsupervised degradation indicator that is capable of detecting the exact moment the AIOps model requires retraining as a result of changes in data. We further show how employing McUDI in the maintenance pipeline of AIOps solutions can reduce the number of samples that require annotations with 30k for job failure prediction and 260k for disk failure prediction while achieving similar performance with periodic retraining.
翻译:由于运维数据的持续变化,AI运维(AIOps)解决方案的性能会随时间推移而退化。尽管周期性重训练是保持故障预测AIOps模型长期性能的最先进技术,但该技术需要大量标注数据才能进行重训练。在AIOps领域,获取标注数据成本高昂,因为需要领域专家进行密集标注。本文提出了McUDI,一种以模型为中心的无监督退化指标,能够精确检测因数据变化而需要重训练AIOps模型的时间点。我们进一步证明,在AIOps解决方案的维护流程中采用McUDI,可在实现与周期性重训练相似性能的同时,将需要标注的样本数量减少——作业故障预测场景减少3万个样本,磁盘故障预测场景减少26万个样本。