Resource sharing between multiple workloads has become a prominent practice among cloud service providers, motivated by demand for improved resource utilization and reduced cost of ownership. Effective resource sharing, however, remains an open challenge due to the adverse effects that resource contention can have on high-priority, user-facing workloads with strict Quality of Service (QoS) requirements. Although recent approaches have demonstrated promising results, those works remain largely impractical in public cloud environments since workloads are not known in advance and may only run for a brief period, thus prohibiting offline learning and significantly hindering online learning. In this paper, we propose RAPID, a novel framework for fast, fully-online resource allocation policy learning in highly dynamic operating environments. RAPID leverages lightweight QoS predictions, enabled by domain-knowledge-inspired techniques for sample efficiency and bias reduction, to decouple control from conventional feedback sources and guide policy learning at a rate orders of magnitude faster than prior work. Evaluation on a real-world server platform with representative cloud workloads confirms that RAPID can learn stable resource allocation policies in minutes, as compared with hours in prior state-of-the-art, while improving QoS by 9.0x and increasing best-effort workload performance by 19-43%.
翻译:多工作负载间的资源共享已成为云服务提供商的普遍做法,其动机源于提高资源利用率及降低拥有成本的需求。然而,由于资源争用可能对具有严格服务质量(QoS)要求的高优先级面向用户工作负载产生不利影响,有效实现资源共享仍是一项开放性挑战。尽管近期方法已展现出令人鼓舞的成果,但这些工作在公共云环境中仍基本不具实用性,因为工作负载并非预先可知,且可能仅运行短暂时间,从而禁止离线学习并严重阻碍在线学习。本文提出RAPID——一种面向高度动态运行环境的快速全在线资源分配策略学习新型框架。RAPID利用轻量级QoS预测(通过受领域知识启发的样本效率与偏差缩减技术实现),将控制与常规反馈源解耦,并以比先前工作快数个数量级的速度引导策略学习。在真实服务器平台上使用代表性云工作负载进行的评估证实,RAPID能在数分钟内学习到稳定的资源分配策略,而现有最先进技术需要数小时,同时将QoS提升9.0倍,并将尽力而为工作负载性能提高19-43%。