Radiation-induced soft errors are one of the most challenging issues in Safety Critical Real-Time Embedded System (SACRES) reliability, usually handled using different flavors of Double Modular Redundancy (DMR) techniques. This solution is becoming unaffordable due to the complexity of modern micro-processors in all domains. This paper addresses the promising field of using Artificial Intelligence (AI) based hardware detectors for soft errors. To create such cores and make them general enough to work with different software applications, microarchitectural attributes are a fascinating option as candidate fault detection features. Several processors already track these features through dedicated Performance Monitoring Unit (PMU). However, there is an open question to understand to what extent they are enough to detect faulty executions. Exploiting the capability of gem5 to simulate real computing systems, perform fault injection experiments and profile microarchitectural attributes (i.e., gem5 Stats), this paper presents the results of a comprehensive analysis regarding the potential attributes to detect soft error and the associated models that can be trained with these features.
翻译:辐射诱发的软错误是安全关键实时嵌入式系统(SACRES)可靠性面临的最具挑战性问题之一,通常采用不同形式的双模冗余(DMR)技术进行处理。由于现代微处理器在各领域中的复杂性日益增加,这种解决方案正变得难以承受。本文探讨了利用基于人工智能(AI)的硬件检测器来应对软错误这一有前景的领域。为了创建此类核心并使其具有足够的通用性以适用于不同的软件应用,微架构属性作为候选的故障检测特征是一个极具吸引力的选择。已有多个处理器通过专用的性能监控单元(PMU)来追踪这些特征。然而,关于这些特征在多大程度上足以检测故障执行仍是一个待解问题。本文利用gem5模拟真实计算系统的能力,开展故障注入实验并分析微架构属性(即gem5统计量),呈现了一项关于潜在故障检测属性及相关可训练模型的综合分析结果。