Graphics Processing Units (GPUs) are over-stressed to accelerate High-Performance Computing applications and are used to accelerate Deep Neural Networks in several domains where they have a life expectancy of many years. These conditions expose the GPUs hardware to (premature) aging, causing permanent faults to arise after the usual end-of-manufacturing test. Techniques to assess the impact of permanent faults in GPUs are then strongly required, thus allowing to estimate the reliability risk and to possibly mitigate it. In this paper, we present a method to evaluate the effects of permanent faults affecting the GPU scheduler and control units, which are the most peculiar and stressed resources, along with the first figures that allow quantifying these effects. We characterize over 5.83x10^5 permanent fault effects in the scheduler and controllers of a gate-level GPU model. Then, we map the observed error categories in software by instrumenting the code of 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors. Our two-level fault injection strategy reduces the evaluation time from hundreds of years of gate-level evaluation to hundreds of hours.We found that faults in the GPU parallelism management units can modify the opcode, the addresses, and the status of thread(s) and warp(s). The large majority (up to 99%) of these hardware permanent errors impacts the running software execution. Errors affecting the instruction operation or resource management hang the code, while 45% of errors in the parallelism management or control-flow induce silent data corruptions.
翻译:图形处理单元(GPU)在高性能计算应用中承受极大压力,并被用于加速多个领域中的深度神经网络,而这些应用需要数年寿命。这些条件导致GPU硬件(过早)老化,从而在常规制造测试结束后出现永久性故障。因此,亟需评估永久故障对GPU影响的技术,以量化可靠性风险并可能缓解风险。本文提出一种评估GPU调度器与控制单元(作为最特殊且承受最大压力的资源)中永久故障影响的方法,同时首次给出量化这些影响的参数。我们表征了门级GPU模型中调度器和控制器超过5.83×10^5种永久故障效应,并通过对13个应用程序和两个卷积神经网络进行代码注入,映射超过1.65×10^5种永久错误在软件中对应的错误类型。这种两级故障注入策略将评估时间从门级评估所需的数百年缩短至数百小时。研究发现,GPU并行管理单元中的故障可能修改操作码、地址以及线程/线程束的状态。这些硬件永久错误中的绝大多数(高达99%)会影响正在运行的软件执行。影响指令操作或资源管理的错误会导致代码挂起,而并行管理或控制流中45%的错误则导致静默数据损坏。