Over the past ten years, many different approaches have been proposed for different aspects of the problem of resources management for long running, dynamic and diverse workloads such as processing query streams or distributed deep learning. Particularly for applications consisting of containerized microservices, researchers have attempted to address problems of dynamic selection of, for example: types and quantities of virtualized services (e.g., IaaS/VMs), vertical and horizontal scaling of different microservices, assigning microservices to VMs, task scheduling, or some combination thereof. In this context, we argue that frameworks like simulated annealing are highly suitable for online navigation of trade-offs between performance (SLO) and cost, particularly when the complex workloads and cloud-service offerings vary over time. Based on a macroscopic objective that combines both performance and cost terms, annealing facilitates light-weight and coherent policies of exploration and exploitation. In this paper, we first give some background on simulated annealing and then experimentally demonstrate its usefulness for different case studies, including service selection for both a single type of workload (e.g., distributed deep learning) and a mixture of workload types (exploring a partially categorical set of options), and container sizing for microservice benchmarks. We conclude with a discussion of how the basic annealing platform can be applied to other resource-management problems, hybridized with other methods, and accommodate user-specified rules of thumb.
翻译:过去十年中,针对长时运行、动态且多样化工作负载(如处理查询流或分布式深度学习)的资源管理问题,研究者从不同方面提出了多种方法。特别是对于由容器化微服务组成的应用,研究人员试图解决动态选择问题,例如:虚拟化服务的类型和数量(如IaaS/虚拟机)、不同微服务的垂直和水平扩展、微服务到虚拟机的分配、任务调度,或这些因素的某种组合。在此背景下,我们认为模拟退火等框架非常适合在线权衡性能(服务等级目标)与成本,尤其是在复杂工作负载和云服务产品随时间变化的情况下。基于结合性能与成本项的宏观目标函数,退火算法能够实现轻量且协调的探索与利用策略。本文首先介绍模拟退火的相关背景,然后通过实验证明其在多个案例研究中的有效性,包括单一工作负载类型(如分布式深度学习)和混合工作负载类型(探索部分分类选项集)的服务选择,以及微服务基准测试的容器容量调整。最后,我们讨论了基础退火平台如何应用于其他资源管理问题、与其他方法混合使用,以及如何适配用户指定的经验规则。