We discovered that a GPU kernel can have both idempotent and non-idempotent instances depending on the input. These kernels, called conditionally-idempotent, are prevalent in real-world GPU applications (490 out of 547 from six applications). Consequently, prior work that classifies GPU kernels as either idempotent or non-idempotent can severely compromise the correctness or efficiency of idempotence-based systems. This paper presents PICKER, the first system for instance-level idempotency validation. PICKER dynamically validates the idempotency of GPU kernel instances before their execution, by utilizing their launch arguments. Several optimizations are proposed to significantly reduce validation latency to microsecond-scale. Evaluations using representative GPU applications (547 kernels and 18,217 instances in total) show that PICKER can identify idempotent instances with no false positives and a false-negative rate of 18.54%, and can complete the validation within 5 us for all instances. Furthermore, by integrating PICKER, a fault-tolerant system can reduce the checkpoint cost to less than 4% and a scheduling system can reduce the preemption latency by 84.2%.
翻译:我们发现,GPU内核根据输入的不同,可能同时存在幂等和非幂等实例。这类被称为条件幂等性的内核在实际GPU应用中普遍存在(来自六个应用的547个内核中有490个属于此类)。因此,先前将GPU内核简单归类为幂等或非幂等的研究,会严重损害基于幂等性系统的正确性或效率。本文提出了首个实例级幂等性验证系统PICKER。该系统通过利用内核的启动参数,在执行前动态验证GPU内核实例的幂等性。我们提出了多项优化技术,将验证延迟显著降低至微秒级。对代表性GPU应用(总计547个内核和18,217个实例)的评估表明,PICKER能够以零误报率和18.54%的漏报率识别幂等实例,且所有实例的验证时间均在5微秒内完成。此外,通过集成PICKER,容错系统可将检查点开销降低至4%以下,调度系统可将抢占延迟减少84.2%。