Autoscaling is a technology to automatically scale the resources provided to their applications without human intervention to guarantee runtime Quality of Service (QoS) while saving costs. However, user-facing cloud applications serve dynamic workloads that often exhibit variable and contain bursts, posing challenges to autoscaling for maintaining QoS within Service-Level Objectives (SLOs). Conservative strategies risk over-provisioning, while aggressive ones may cause SLO violations, making it more challenging to design effective autoscaling. This paper introduces BAScaler, a Burst-Aware Autoscaling framework for containerized cloud services or applications under complex workloads, combining multi-level machine learning (ML) techniques to mitigate SLO violations while saving costs. BAScaler incorporates a novel prediction-based burst detection mechanism that distinguishes between predictable periodic workload spikes and actual bursts. When bursts are detected, BAScaler appropriately overestimates them and allocates resources accordingly to address the rapid growth in resource demand. On the other hand, BAScaler employs reinforcement learning to rectify potential inaccuracies in resource estimation, enabling more precise resource allocation during non-bursts. Experiments across ten real-world workloads demonstrate BAScaler's effectiveness, achieving a 57% average reduction in SLO violations and cutting resource costs by 10% compared to other prominent methods.
翻译:自动伸缩技术是一种无需人工干预即可自动调整应用程序资源分配的技术,旨在在保证运行时服务质量(QoS)的同时降低资源成本。然而,面对用户层的云应用服务通常具有动态变化且包含突发特性的工作负载,这给自动伸缩技术带来了挑战——如何在服务等级协议(SLO)约束下维持QoS。保守策略可能导致资源过度分配,而激进策略则可能引发SLO违规,使得设计有效的自动伸缩方法更加困难。本文提出了BAScaler——一种面向容器化云服务或应用的突发感知自动伸缩框架,该框架融合多层机器学习(ML)技术,在降低SLO违规率的同时节省成本。BAScaler引入了一种基于预测的新型突发检测机制,能够区分可预见的周期性工作负载峰值与真实突发。当检测到突发时,BAScaler会适当高估突发程度并相应分配资源,以应对资源需求的快速增长。另一方面,在非突发时段,BAScaler采用强化学习纠正资源估计中的潜在偏差,实现更精准的资源分配。在十个真实工作负载上的实验表明,BAScaler效果显著:与主流方法相比,其SLO违规率平均降低57%,同时资源成本削减10%。