Single-event upset (SEU) fault tolerance for systems-on-chip (SoCs) in radiation-heavy environments is often addressed by architectural fault-tolerance approaches protecting individual SoC components (e.g., cores, memories) in isolation. However, the protection of voting logic and interconnections among components is also critical, as these become single points of failure in the design. We investigate combining multiple fault-tolerance approaches targeting individual SoC components, including interconnect and voting logic to ensure end-to-end SoC-level architectural SEU fault tolerance, while minimizing implementation area overheads. Enforcing an overlap between the protection methods ensures hardening of the whole design without gaps, while curtailing overheads. We demonstrate our approach on a RISC-V microcontroller SoC. SEU fault-tolerance is assessed with simulation-based fault injection. Overheads are assessed with full physical implementation. Tolerance to over 99.9% of faults in both RTL and implemented netlist is demonstrated. Furthermore, the design exhibits 22% lower implementation overhead compared to a single global fault-tolerance method, such as fine-grained triplication.
翻译:针对强辐射环境下片上系统(SoC)的单粒子翻转(SEU)容错,通常通过保护单个SoC组件(如核心、存储器)的架构级容错方法独立实现。然而,投票逻辑和组件间互连的保护也至关重要,因为这些部分在设计中将变为单点故障。我们研究了组合多种针对单个SoC组件的容错方法(包括互连和投票逻辑),以确保端到端的SoC级架构SEU容错,同时最小化实现面积开销。通过强制各保护方法之间重叠,可在减少开销的同时确保整个设计的加固无间隙。我们在RISC-V微控制器SoC上演示了该方法。SEU容错性通过基于模拟的故障注入进行评估,开销则通过完整物理实现进行评估。结果表明,在RTL级和实现后的网表级,该方法对超过99.9%的故障具有容错能力。此外,与单一全局容错方法(如细粒度三模冗余)相比,该设计的实现开销降低了22%。