Allocating resources in a distributed environment is a fundamental challenge. In this paper, we analyze the scheduling and placement of virtual machines (VMs) in the cloud platform of SAP, the world's largest enterprise resource planning software vendor. Based on data from roughly 1,800 hypervisors and 48,000 VMs within a 30-day observation period, we highlight potential improvements for workload management. The data was measured through observability tooling that tracks resource usage and performance metrics across the entire infrastructure. In contrast to existing datasets, ours uniquely offers fine-grained time-series telemetry data of fully virtualized enterprise-level workloads from both long-running and memory-intensive SAP S/4HANA and diverse, general-purpose applications. Our key findings include several suboptimal scheduling situations, such as CPU resource contention exceeding 40%, CPU ready times of up to 220 seconds, significantly imbalanced compute hosts with a maximum CPU~utilization on intra-building block hosts of up to 99%, and overprovisioned CPU and memory resources resulting into over 80% of VMs using less than 70% of the provided resources. Bolstered by these findings, we derive requirements for the design and implementation of novel placement and scheduling algorithms and provide guidance to optimize resource allocations. We make the full dataset used in this study publicly available to enable data-driven evaluations of scheduling approaches for large-scale cloud infrastructures in future research.
翻译:在分布式环境中分配资源是一个基础性挑战。本文分析了全球最大企业资源规划软件供应商SAP云平台中虚拟机(VMs)的调度与放置问题。基于约1,800台管理程序与48,000个虚拟机在30天观测周期内的数据,我们揭示了工作负载管理的潜在改进方向。数据通过可观测性工具采集,该工具追踪整个基础设施的资源使用与性能指标。与现有数据集相比,本数据集独特地提供了完全虚拟化的企业级工作负载的细粒度时间序列遥测数据,涵盖长期运行的SAP S/4HANA内存密集型应用及多样化的通用应用程序。关键发现包括多种次优调度情况:CPU资源争用率超过40%、CPU就绪时间高达220秒、计算主机负载严重失衡(机架内主机CPU利用率峰值达99%),以及CPU与内存资源过度配置导致超过80%的虚拟机使用率低于所分配资源的70%。基于这些发现,我们推导出新型放置与调度算法设计实施的需求,并为优化资源分配提供指导。本研究使用的完整数据集已公开,以支持未来研究中大规模云基础设施调度方法的数据驱动评估。