Resource autoscaling mechanisms in cloud environments depend on accurate performance metrics to make optimal provisioning decisions. When infrastructure faults including hardware malfunctions, network disruptions, and software anomalies corrupt these metrics, autoscalers may systematically over- or under-provision resources, resulting in elevated operational expenses or degraded service reliability. This paper conducts controlled simulation experiments to measure how four prevalent fault categories affect both vertical and horizontal autoscaling behaviors across multiple instance configurations and service level objective (SLO) thresholds. Experimental findings demonstrate that storage-related faults generate the largest cost overhead, adding up to $258 monthly under horizontal scaling policies, whereas routing anomalies consistently bias autoscalers toward insufficient resource allocation. The sensitivity to fault-induced metric distortions differs markedly between scaling strategies: horizontal autoscaling exhibits greater susceptibility to transient anomalies, particularly near threshold boundaries. These empirically-grounded insights offer actionable recommendations for designing fault-tolerant autoscaling policies that distinguish genuine workload fluctuations from failure artifacts.
翻译:云环境中的资源自动扩缩容机制依赖准确的性能指标以做出最优资源配置决策。当硬件故障、网络中断及软件异常等基础设施故障破坏这些指标时,自动扩缩容系统可能系统性地过度配置或配置不足资源,导致运营成本上升或服务可靠性下降。本文通过受控仿真实验,测量了四类常见故障在多种实例配置与服务等级目标(SLO)阈值下对垂直与水平自动扩缩容行为的影响。实验结果表明:存储相关故障产生的成本开销最大,在水平扩缩策略下每月最高增加258美元;而路由异常则持续导致自动扩缩容系统偏向资源分配不足。不同扩缩策略对故障所致指标失真的敏感度存在显著差异:水平自动扩缩容对瞬时异常表现出更高的易感性,在阈值边界附近尤为明显。这些基于实证的发现为设计容错自动扩缩容策略提供了可行建议,以区分真实工作负载波动与故障伪影。