Space Cyber-Physical Systems (S-CPS) such as spacecraft and satellites strongly rely on the reliability of onboard computers to guarantee the success of their missions. Relying solely on radiation-hardened technologies is extremely expensive, and developing inflexible architectural and microarchitectural modifications to introduce modular redundancy within a system leads to significant area increase and performance degradation. To mitigate the overheads of traditional radiation hardening and modular redundancy approaches, we present a novel Hybrid Modular Redundancy (HMR) approach, a redundancy scheme that features a cluster of RISC-V processors with a flexible on-demand dual-core and triple-core lockstep grouping of computing cores with runtime split-lock capabilities. Further, we propose two recovery approaches, software-based and hardware-based, trading off performance and area overhead. Running at 430 MHz, our fault-tolerant cluster achieves up to 1160 MOPS on a matrix multiplication benchmark when configured in non-redundant mode and 617 and 414 MOPS in dual and triple mode, respectively. A software-based recovery in triple mode requires 363 clock cycles and occupies 0.612 mm2, representing a 1.3% area overhead over a non-redundant 12-core RISC-V cluster. As a high-performance alternative, a new hardware-based method provides rapid fault recovery in just 24 clock cycles and occupies 0.660 mm2, namely ~9.4% area overhead over the baseline non-redundant RISC-V cluster. The cluster is also enhanced with split-lock capabilities to enter one of the redundant modes with minimum performance loss, allowing execution of a mission-critical or a performance section, with <400 clock cycles overhead for entry and exit. The proposed system is the first to integrate these functionalities on an open-source RISC-V-based compute device, enabling finely tunable reliability vs. performance trade-offs.
翻译:空间信息物理系统(如航天器、卫星)强依赖星载计算机的可靠性以确保任务成功。仅依赖抗辐射技术成本极高,而通过僵化的架构及微架构修改在系统中引入模块冗余会导致面积显著增加和性能下降。为缓解传统抗辐射技术与模块冗余方法带来的开销,我们提出新型混合模块冗余(HMR)方案——一种基于RISC-V处理器集群的冗余架构,具备灵活按需的双核与三核锁步计算核分组功能,并支持运行时拆分-锁步能力。进一步,我们提出软件与硬件两种恢复方法,在性能与面积开销间进行权衡。该容错集群工作在430 MHz频率下,非冗余模式下矩阵乘法基准测试性能可达1160 MOPS,双核与三核模式下分别为617 MOPS和414 MOPS。三核模式下基于软件的恢复需363个时钟周期,占用0.612 mm²面积,相比非冗余12核RISC-V集群带来1.3%的面积开销。作为高性能替代方案,新型硬件恢复方法仅需24个时钟周期即可快速实现故障恢复,占用0.660 mm²面积,较基准非冗余RISC-V集群产生约9.4%的面积开销。集群还增强了拆分-锁步能力,可在最小性能损失下进入任一冗余模式,支持执行关键任务或性能优化阶段,进入/退出开销小于400个时钟周期。本系统是首个在开源RISC-V计算器件上集成上述功能的方案,实现了可靠性-性能的精调权衡。