Operational networks commonly rely on machine learning models for many tasks, including detecting anomalies, inferring application performance, and forecasting demand. Yet, model accuracy can degrade due to concept drift, whereby the relationship between the features and the target to be predicted changes. Mitigating concept drift is an essential part of operationalizing machine learning models in general, but is of particular importance in networking's highly dynamic deployment environments. In this paper, we first characterize concept drift in a large cellular network for a major metropolitan area in the United States. We find that concept drift occurs across many important key performance indicators (KPIs), independently of the model, training set size, and time interval -- thus necessitating practical approaches to detect, explain, and mitigate it. We then show that frequent model retraining with newly available data is not sufficient to mitigate concept drift, and can even degrade model accuracy further. Finally, we develop a new methodology for concept drift mitigation, Local Error Approximation of Features (LEAF). LEAF works by detecting drift; explaining the features and time intervals that contribute the most to drift; and mitigates it using forgetting and over-sampling. We evaluate LEAF against industry-standard mitigation approaches (notably, periodic retraining) with more than four years of cellular KPI data. Our initial tests with a major cellular provider in the US show that LEAF consistently outperforms periodic and triggered retraining on complex, real-world data while reducing costly retraining operations.
翻译:摘要:运营网络通常依赖机器学习模型执行多项任务,包括检测异常、推断应用性能以及预测需求。然而,由于概念漂移(即特征与目标预测变量之间的关系发生变化),模型精度可能会下降。缓解概念漂移是机器学习模型实际运营中的关键环节,在网络高度动态的部署环境中尤其重要。本文首先对美国某大都市区大型蜂窝网络中的概念漂移进行了特征化分析。研究发现,概念漂移广泛存在于众多关键性能指标(KPI)中,且独立于模型、训练集大小及时间间隔——因此需要实用方法来检测、解释并缓解该现象。随后,我们证明仅通过利用新数据进行频繁模型重训练不足以缓解概念漂移,甚至可能进一步降低模型精度。最后,我们提出了一种新的概念漂移缓解方法——局部特征误差近似(LEAF)。LEAF通过检测漂移、解释贡献最大的特征及时间区间,并利用遗忘和过采样技术缓解漂移。我们使用超过四年的蜂窝KPI数据,将LEAF与行业标准缓解方法(尤其是周期性重训练)进行了对比评估。在美国某主要蜂窝网络运营商的初步测试表明,LEAF在复杂真实数据上始终优于周期性及触发式重训练,同时减少了昂贵的重训练操作。