Hyperscaler reports of silent data corruptions (SDCs), presumed to be caused by silicon manufacturing defects, have motivated the development of functional tests for detecting defective CPUs. We present ITHICA, an approach for automatically generating functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight is that the most pernicious defects cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and identifies affected instructions upon error detections. We use ITHICA to transform industrial hyperscaler test programs (our baseline), datacenter workloads, and common libraries into functional tests, and evaluate them on over 3,000 CPU servers. ITHICA error checks detect 39% more defective servers than native checks within the ITHICA tests derived from our baseline programs, and enable novel findings on defect behavior that challenge conclusions drawn by prior hyperscaler fleet studies.
翻译:超大规模服务器中报告的静默数据损坏(SDC)——据推测由硅制造缺陷引起——推动了用于检测有缺陷CPU的功能测试的开发。我们提出ITHICA方法,该方法通过插入线程内指令级错误检查(主要利用指令复制与输出比较),能够从任意程序中自动生成面向缺陷诱导错误的功能测试。我们的核心洞察在于:最具危害性的缺陷会导致不一致错误——即同一线程内相同指令在给定相同输入的情况下,可能因执行上下文不同而产生不同的架构输出。基于这一发现,ITHICA使任意程序均可作为测试用例,并在检测到错误时识别受影响的指令。我们使用ITHICA将工业级超大规模服务器测试程序(我们的基线)、数据中心负载及通用库转化为功能测试,并在超过3000台CPU服务器上进行评估。在从基线程序衍生的ITHICA测试中,ITHICA错误检查检测到的缺陷服务器数量比原生检查多39%,且揭示了有关缺陷行为的新发现,这些发现挑战了先前超大规模服务器集群研究的结论。