Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.
翻译:AI系统的可靠性是AI技术成功部署与广泛应用的根本关切。遗憾的是,AI硬件系统日益增长的复杂性与异构性使其越来越容易受到硬件故障(例如静默数据损坏)的影响,这些故障可能损坏模型参数。当这种情况发生在AI推理/服务过程中时,可能导致向用户提供错误或性能下降的模型输出,最终影响AI服务的质量与可靠性。鉴于威胁不断升级,解决以下关键问题至关重要:AI模型对参数损坏的脆弱性如何?模型的不同组成部分(例如模块、层)对参数损坏是否表现出不同的脆弱性?为系统性地解答此问题,我们提出一种新颖的定量度量指标——参数脆弱性因子(PVF),其灵感来源于计算机体系结构领域的体系结构脆弱性因子(AVF),旨在标准化AI模型针对参数损坏的脆弱性量化。我们将模型参数的PVF定义为该特定模型参数发生损坏导致错误输出的概率。本文展示了将PVF应用于推理阶段三类任务/模型的多个用例——推荐系统(DLRM)、视觉分类(CNN)与文本分类(BERT),并对DLRM进行了深入的脆弱性分析。PVF能为AI硬件设计者在权衡故障防护与性能/效率(例如将脆弱的AI参数组件映射到受良好保护的硬件模块)时提供关键见解。PVF指标适用于任何AI模型,并有望帮助统一和标准化AI脆弱性/韧性评估实践。