Understanding the Effects of Permanent Faults in GPU's Parallelism Management and Control Units

Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10^5 permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map the observed error categories in software by instrumenting the code of 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors. Our two-level fault injection strategy reduces the evaluation time from hundreds of years of gate-level evaluation to hundreds of hours.We found that faults in the GPU parallelism management units can modify the opcode, the addresses, and the status of thread(s) and warp(s). The large majority (up to 99%) of these hardware permanent errors impacts the running software execution. Errors affecting the instruction operation or resource management hang the code, while 45% of errors in the parallelism management or control-flow induce silent data corruptions.

翻译：图形处理器（GPU）被过度用于加速高性能计算应用，并在多个领域用于加速深度神经网络，这些应用中其预期寿命可达多年。此类工况使GPU硬件承受（过早）老化，导致在常规制造测试结束后出现永久性故障。因此，迫切需要评估永久故障对GPU影响的技术，以便估算可靠性风险并可能加以缓解。本文提出一种评估影响GPU调度器与控制单元（即最特殊且应力最大的资源）永久故障影响的方法，并首次提供量化这些影响的数据。我们在门级GPU模型的调度器和控制器中表征了超过5.83×10^5种永久故障效应。随后，通过对13个应用程序和两个卷积神经网络的代码进行插桩，将观测到的错误类别映射到软件层面，注入了超过1.65×10^5个永久性错误。我们的两级故障注入策略将评估时间从门级评估所需的数百年缩短至数百小时。我们发现，GPU并行管理单元中的故障会修改操作码、地址以及线程或线程束的状态。这些硬件永久性错误中的绝大多数（高达99%）会影响正在运行的软件执行。影响指令操作或资源管理的错误会导致代码卡死，而并行管理或控制流中45%的错误会引发静默数据损坏。