With the shrinking of technology nodes and the use of parallel processor clusters in hostile and critical environments, such as space, run-time faults caused by radiation are a serious cross-cutting concern, also impacting architectural design. This paper introduces an architectural approach to run-time configurable soft-error tolerance at the core level, augmenting a six-core open-source RISC-V cluster with a novel On-Demand Redundancy Grouping (ODRG) scheme. ODRG allows the cluster to operate either as two fault-tolerant cores, or six individual cores for high-performance, with limited overhead to switch between these modes during run-time. The ODRG unit adds less than 11% of a core's area for a three-core group, or a total of 1% of the cluster area, and shows negligible timing increase, which compares favorably to a commercial state-of-the-art implementation, and is 2.5$\times$ faster in fault recovery re-synchronization. Furthermore, when redundancy is not necessary, the ODRG approach allows the redundant cores to be used for independent computation, allowing up to 2.96$\times$ increase in performance for selected applications.
翻译:随着技术节点的缩小以及并行处理器集群在恶劣关键环境(如太空)中的应用,由辐射引起的运行时故障已成为一个跨领域的严重问题,并对架构设计产生影响。本文提出了一种在核心层面实现运行时可配置软错误容错的架构方法,通过创新的按需冗余分组(ODRG)方案增强了一个六核开源RISC-V集群。ODRG允许集群以两种模式运行:两个容错核心,或六个独立核心以实现高性能,且模式切换的运行时开销有限。ODRG单元为三核组增加的核心面积不足11%,仅占集群总面积的1%,且时序增加可忽略不计;与商业级先进实现相比,其故障恢复重同步速度提升了2.5倍。此外,当无需冗余时,ODRG方法允许冗余核心用于独立计算,在特定应用中性能提升可达2.96倍。