ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

Hyperscaler reports of silent data corruptions (SDCs), presumed to be caused by silicon manufacturing defects, have motivated the development of functional tests for detecting defective CPUs. We present ITHICA, an approach for automatically generating functional tests for defect-induced errors from arbitrary programs by inserting intra-thread, instruction-level error checks, primarily leveraging instruction duplication and output comparison. Our key insight is that the most pernicious defects cause inconsistent errors: two executions of the same instruction within the same thread, given the same inputs, can produce different architectural outputs depending on the execution context in which they run. By exploiting this insight, ITHICA enables arbitrary programs to serve as tests and identifies affected instructions upon error detections. We use ITHICA to transform industrial hyperscaler test programs (our baseline), datacenter workloads, and common libraries into functional tests, and evaluate them on over 3,000 CPU servers. ITHICA error checks detect 39% more defective servers than native checks within the ITHICA tests derived from our baseline programs, and enable novel findings on defect behavior that challenge conclusions drawn by prior hyperscaler fleet studies.

翻译：超大规模服务器中报告的静默数据损坏（SDC）——据推测由硅制造缺陷引起——推动了用于检测有缺陷CPU的功能测试的开发。我们提出ITHICA方法，该方法通过插入线程内指令级错误检查（主要利用指令复制与输出比较），能够从任意程序中自动生成面向缺陷诱导错误的功能测试。我们的核心洞察在于：最具危害性的缺陷会导致不一致错误——即同一线程内相同指令在给定相同输入的情况下，可能因执行上下文不同而产生不同的架构输出。基于这一发现，ITHICA使任意程序均可作为测试用例，并在检测到错误时识别受影响的指令。我们使用ITHICA将工业级超大规模服务器测试程序（我们的基线）、数据中心负载及通用库转化为功能测试，并在超过3000台CPU服务器上进行评估。在从基线程序衍生的ITHICA测试中，ITHICA错误检查检测到的缺陷服务器数量比原生检查多39%，且揭示了有关缺陷行为的新发现，这些发现挑战了先前超大规模服务器集群研究的结论。

相关内容

服务器

关注 14

服务器，也称伺服器，是提供计算服务的设备。由于服务器需要响应服务请求，并进行处理，因此一般来说服务器应具备承担服务并且保障服务的能力。
服务器的构成包括处理器、硬盘、内存、系统总线等，和通用的计算机架构类似，但是由于需要提供高可靠的服务，因此在处理能力、稳定性、可靠性、安全性、可扩展性、可管理性等方面要求较高。

ACM Computing Surveys | 港大等基于可靠性视角的深度伪造检测综述，覆盖主流基准库、模型

专知会员服务

17+阅读 · 2025年1月12日

弹药异常检测《使用机器学习进行缺陷表征》最佳论文，MODSIM World 2023

专知会员服务

37+阅读 · 2023年7月22日

「工业缺陷检测深度学习方法」最新2022研究综述

专知会员服务

96+阅读 · 2022年7月2日

基于深度神经网络的图像缺损修复方法综述

专知会员服务

26+阅读 · 2021年12月18日