Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end performance and concealing performance issues, which complicates root cause analysis for failures and regressions. We introduce SuperBench, a proactive validation system for AI infrastructure that mitigates hidden degradation caused by hardware redundancies and enhances overall reliability. SuperBench features a comprehensive benchmark suite, capable of evaluating individual hardware components and representing most real AI workloads. It comprises a Validator which learns benchmark criteria to clearly pinpoint defective components. Additionally, SuperBench incorporates a Selector to balance validation time and issue-related penalties, enabling optimal timing for validation execution with a tailored subset of benchmarks. Through testbed evaluation and simulation, we demonstrate that SuperBench can increase the mean time between incidents by up to 22.61x. SuperBench has been successfully deployed in Azure production, validating hundreds of thousands of GPUs over the last two years.
翻译:云AI基础设施的可靠性对云服务提供商至关重要,这促使硬件冗余被广泛采用。然而,这些冗余可能无意中导致AI工作负载的隐性性能退化,即所谓的“灰度故障”,显著影响端到端性能并掩盖性能问题,从而使故障和性能回归的根因分析变得复杂。我们提出了SuperBench,一个用于AI基础设施的主动验证系统,旨在缓解由硬件冗余引起的隐性退化并提升整体可靠性。SuperBench具备一个全面的基准测试套件,能够评估单个硬件组件并表征大多数真实的AI工作负载。它包含一个验证器,该验证器学习基准测试标准以清晰定位故障组件。此外,SuperBench还集成了一个选择器,用于平衡验证时间与问题相关代价,从而通过定制的基准测试子集实现最佳验证执行时机。通过测试平台评估和仿真,我们证明SuperBench可将平均无故障时间最多提升22.61倍。SuperBench已成功部署于Azure生产环境,在过去两年中验证了数十万个GPU。