Heterogeneous parallel error detection is an approach to achieving fault-tolerant processors, leveraging multiple power-efficient cores to re-execute software originally run on a high-performance core. Yet, its complex components, gathering data cross-chip from many parts of the core, raise questions of how to build it into commodity cores without heavy design invasion and extensive re-engineering. We build the first full-RTL design, MEEK, into an open-source SoC, from microarchitecture and ISA to the OS and programming model. We identify and solve bottlenecks and bugs overlooked in previous work, and demonstrate that MEEK offers microsecond-level detection capacity with affordable overheads. By trading off architectural functionalities across codesigned hardware-software layers, MEEK features only light changes to a mature out-of-order superscalar core, simple coordinating software layers, and a few lines of operating-system code. The Repo. of MEEK's source code: https://github.com/SEU-ACAL/reproduce-MEEK-DAC-25.
翻译:异构并行错误检测是一种实现容错处理器的方法,它利用多个高能效核心来重新执行原本运行在高性能核心上的软件。然而,其复杂的组件需要从核心的许多部分跨芯片收集数据,这引发了如何将其构建到商用核心中而无需繁重的设计侵入和大量重新工程化的问题。我们构建了首个完整的RTL设计——MEEK,并将其集成到一个开源SoC中,涵盖从微架构和指令集架构到操作系统和编程模型的全部层次。我们识别并解决了先前工作中被忽视的瓶颈和错误,并证明MEEK能够以可承受的开销提供微秒级的检测能力。通过在协同设计的硬件-软件层之间权衡架构功能,MEEK仅对成熟的乱序超标量核心进行了少量修改,协调软件层简单,且仅需改动少数几行操作系统代码。MEEK源代码仓库:https://github.com/SEU-ACAL/reproduce-MEEK-DAC-25。