Cloud platforms are increasingly relied upon to host diverse, resource-intensive workloads due to their scalability, flexibility, and cost-efficiency. In multi-tenant cloud environments, virtual machines are consolidated on shared physical servers to improve resource utilization. While virtualization guarantees resource partitioning for CPU, memory, and storage, it cannot ensure performance isolation. Competition for shared resources such as last-level cache, memory bandwidth, and network interfaces often leads to severe performance degradation. Existing management techniques, including VM scheduling and resource provisioning, require accurate performance prediction to mitigate interference. However, this remains challenging in public clouds due to the black-box nature of VMs and the highly dynamic nature of workloads. To address these limitations, we propose CloudFormer, a dual-branch Transformer-based model designed to predict VM performance degradation in black-box environments. CloudFormer jointly models temporal dynamics and system-level interactions, leveraging 206 system metrics at one-second resolution across both static and dynamic scenarios. This design enables the model to capture transient interference effects and adapt to varying workload conditions without scenario-specific tuning. Complementing the methodology, we provide a fine-grained dataset that significantly expands the temporal resolution and metric diversity compared to existing benchmarks. Experimental results demonstrate that CloudFormer consistently outperforms state-of-the-art baselines across multiple evaluation metrics, achieving robust generalization across diverse and previously unseen workloads. Notably, CloudFormer attains a mean absolute error (MAE) of just 7.8%, representing a substantial improvement in predictive accuracy and outperforming existing methods at least by 28%.
翻译:云平台因其可扩展性、灵活性和成本效益,日益成为承载多样化资源密集型工作负载的核心依赖。在多租户云环境中,虚拟机通过整合至共享物理服务器以提升资源利用率。尽管虚拟化技术保证了CPU、内存和存储的资源隔离,却无法实现性能隔离。对末级缓存、内存带宽及网络接口等共享资源的竞争常导致严重的性能降级。现有管理技术(包括虚拟机调度与资源供应)需依赖精准的性能预测以缓解干扰,然而在公有云中,由于虚拟机的黑箱特性及工作负载的高度动态性,这一目标仍面临挑战。为突破上述局限,我们提出CloudFormer——一种面向黑箱环境预测虚拟机性能降级的双分支Transformer模型。该模型联合建模时序动态与系统级交互,在静态与动态场景下均能利用每秒采样频率的206项系统指标。这一设计使模型无需场景特定调优即可捕获瞬时干扰效应并适应多变工作负载条件。作为方法论补充,我们提供了一套细粒度数据集,相较现有基准显著扩展了时间分辨率与指标多样性。实验结果表明,CloudFormer在多项评估指标上持续超越当前最优基线,在多样化及未见工作负载场景中展现出强泛化能力。值得注意的是,CloudFormer仅取得7.8%的平均绝对误差(MAE),其预测精度实现显著提升,性能至少超越现有方法28%。