Rapid CMOS device size reduction resulted in billions of transistors on a chip have led to integration of many cores leading to many challenges such as increased power dissipation, thermal dissipation, occurrence of transient faults and permanent faults. The mitigation of transient faults and permanent faults at the core level has become an important design parameter in a multi-core scenario. Core level techniques is a redundancy-based fault mitigation technique that improves the lifetime reliability of multi-core systems. In an asymmetric multi-core system, the smaller cores provide fault tolerance to larger cores is a core level fault mitigation technique that has gained momentum and focus from many researchers. The paper presents an economical, asymmetric multi-core system with one instruction cores (MCSOIC). The term Hardware Cost Estimation signifies power and area estimation for MCS-OIC. In MCSOIC, OIC is a warm standby redundant core. OICs provide functional support to conventional cores for shorter periods of time. To evaluate the idea, different configurations of MCSOIC is synthesized using FPGA and ASIC. The maximum power overhead and maximum area overhead are 0.46% and 11.4% respectively. The behavior of OICs in MCS-OIC is modelled using a One-Shot System (OSS) model for reliability analysis. The model parameters namely, readiness, wakeup probability and start-up-strategy for OSS are mapped to the multi-core systems with OICs. Expressions for system reliability is derived. System reliability is estimated for special cases.
翻译:CMOS器件尺寸的快速缩减使得芯片上集成数十亿晶体管成为可能,这导致多核集成引发了诸多挑战,如功耗增加、热耗散加剧、瞬态故障与永久故障的发生率上升。在多核场景下,在核心层面缓解瞬态故障与永久故障已成为重要设计参数。基于冗余的核心级容错技术能够提升多核系统的全生命周期可靠性。在非对称多核系统中,利用较小核心为较大核心提供容错能力,已成为众多研究者关注的核心级容错技术。本文提出一种经济型非对称多核系统——单指令核心多核系统(MCSOIC)。硬件成本估算指标用于评估MCSOIC的功耗与面积消耗。在MCSOIC中,单指令核心(OIC)作为热备冗余核心,可在短时间内为传统核心提供功能支持。为验证该理念,采用FPGA与ASIC对不同配置的MCSOIC进行了综合实现。最大功耗开销与面积开销分别为0.46%与11.4%。采用单次使用系统(OSS)模型对MCSOIC中OIC的行为进行可靠性分析建模。将OSS模型的就绪度、唤醒概率与启动策略等参数映射至含OIC的多核系统,推导出系统可靠性表达式,并对特殊情形进行了可靠性估算。