CloudCons: A Comprehensive End-to-End Benchmark for Cloud Resource Consolidation

Driven by conservative over-provisioning to guarantee service reliability, resource utilization in cloud data centers remains at low levels. To mitigate this, the forecast-then-optimize paradigm has emerged to optimize consolidation by anticipating future demands. While emerging time series foundation models promise to enhance this paradigm through zero-shot generalization, existing benchmarks focus solely on prediction error metrics. The actual decision utility of these advanced models remains unverified, rendering their practical value for downstream tasks uncertain. To bridge this gap, we propose CloudCons, a comprehensive end-to-end benchmark designed to evaluate forecasting models within the specific context of cloud resource consolidation. We build high-quality datasets that cover diverse workloads from Huawei Cloud, Microsoft Azure, and Google Borg, capturing distinct service characteristics ranging from synchronized diurnal rhythms to stochastic, pulse-like bursts and high-frequency noise. We conduct an extensive evaluation of statistical, deep learning, and foundation models. Our experiments reveal a pivotal finding: while foundation models demonstrate superior zero-shot forecasting accuracy, this advantage does not inherently translate into better decision utility. Of practical significance, we systematically analyze how the selection of predictive quantiles acts as a critical lever. We provide actionable guidelines for calibrating these selections to balance the trade-off between resource efficiency and service reliability, offering vital insights for real-world deployment decisions.

翻译：受限于为确保服务可靠性而采取的保守过度预配策略，云数据中心的资源利用率长期处于较低水平。为缓解这一问题，预测-优化范式应运而生，通过预测未来需求来优化资源整合。尽管新兴的时间序列基础模型凭借零样本泛化能力有望增强该范式，但现有基准测试仅关注预测误差指标。这些先进模型的实际决策效用仍未得到验证，导致其在下游任务中的实用价值存疑。为填补这一空白，我们提出CloudCons——一个专为云资源整合场景评估预测模型而设计的全面端到端基准测试。我们构建了涵盖华为云、Microsoft Azure及Google Borg多样工作负载的高质量数据集，捕获了从同步昼夜节律到随机脉冲式突发及高频噪声等不同服务特征。我们对统计模型、深度学习模型及基础模型进行了广泛评估。实验揭示了一个关键发现：尽管基础模型在零样本预测精度上表现优异，但这一优势并未天然转化为更优的决策效用。具有实践意义的是，我们系统分析了预测分位数选取如何作为关键调控杠杆，并提供了校准这些选取的可操作指南，以平衡资源效率与服务可靠性之间的权衡，为实际部署决策提供了重要启示。