Ksurf: Attention Kalman Filter and Principal Component Analysis for Prediction under Highly Variable Cloud Workloads

Cloud platforms have become essential in rapidly deploying application systems online to serve large numbers of users. Resource estimation and workload forecasting are critical in cloud data centers. Complexity in the cloud provider environment due to varying numbers of virtual machines introduces high variability in workloads and resource usage, making resource predictions problematic using state-of-the-art models that fail to deal with nonlinear characteristics. Estimating and predicting the resource metrics of cloud systems across packet networks influenced by unknown external dynamics is a task affected by high measurement noise and variance. An ideal solution to these problems is the Kalman filter, a variance-minimizing estimator used for system state estimation and efficient low latency system state prediction. Kalman filters are optimal estimators for highly variable data with Gaussian state space characteristics such as internet workloads. This work provides a solution by making these contributions: i) it introduces and evaluates the Kalman filter-based model parameter prediction using principal component analysis and an attention mechanism for noisy cloud data, ii) evaluates the scheme on a Google Cloud benchmark comparing it to the state-of-the-art Bi-directional Grid Long Short-Term Memory network model on prediction tasks, iii) it applies these techniques to demonstrate the accuracy and stability improvements on a realtime messaging system auto-scaler in Apache Kafka. The new scheme improves prediction accuracy by $37\%$ over state-of-the-art Kalman filters in noisy signal prediction tasks. It reduces the prediction error of the neural network model by over $40\%$. It is shown to improve Apache Kafka workload-based scaling stability by $58\%$.

翻译：云平台已成为快速在线部署应用系统以服务海量用户的关键基础设施。资源估算与工作负载预测在云数据中心中至关重要。由于虚拟机数量动态变化导致的云提供商环境复杂性，引发了工作负载与资源使用的高可变性，这使得采用现有先进模型进行资源预测变得困难，因为这些模型难以处理非线性特征。在受未知外部动态影响的包交换网络中，对云系统资源指标进行估计与预测是一项受高测量噪声和方差干扰的任务。针对这些问题的理想解决方案是卡尔曼滤波器——一种用于系统状态估计和高效低延迟系统状态预测的方差最小化估计器。对于具有高斯状态空间特性（如互联网工作负载）的高可变数据，卡尔曼滤波器是最优估计器。本研究通过以下贡献提出解决方案：i）针对含噪云数据，引入并评估了基于主成分分析和注意力机制的卡尔曼滤波器模型参数预测方法；ii）在Google Cloud基准测试中评估该方案，并将其与先进的双向网格长短期记忆网络模型在预测任务上进行对比；iii）应用这些技术展示了Apache Kafka实时消息系统自动扩缩容器在精度与稳定性方面的提升。新方案在含噪信号预测任务中，较先进卡尔曼滤波器的预测精度提升$37\%$，将神经网络模型的预测误差降低超过$40\%$，并将基于工作负载的Apache Kafka扩缩容稳定性提升$58\%$。