PVF (Parameter Vulnerability Factor): A Scalable Metric for Understanding AI Vulnerability Against SDCs in Model Parameters

Reliability of AI systems is a fundamental concern for the successful deployment and widespread adoption of AI technologies. Unfortunately, the escalating complexity and heterogeneity of AI hardware systems make them increasingly susceptible to hardware faults, e.g., silent data corruptions (SDC), that can potentially corrupt model parameters. When this occurs during AI inference/servicing, it can potentially lead to incorrect or degraded model output for users, ultimately affecting the quality and reliability of AI services. In light of the escalating threat, it is crucial to address key questions: How vulnerable are AI models to parameter corruptions, and how do different components (such as modules, layers) of the models exhibit varying vulnerabilities to parameter corruptions? To systematically address this question, we propose a novel quantitative metric, Parameter Vulnerability Factor (PVF), inspired by architectural vulnerability factor (AVF) in computer architecture community, aiming to standardize the quantification of AI model vulnerability against parameter corruptions. We define a model parameter's PVF as the probability that a corruption in that particular model parameter will result in an incorrect output. In this paper, we present several use cases on applying PVF to three types of tasks/models during inference -- recommendation (DLRM), vision classification (CNN), and text classification (BERT), while presenting an in-depth vulnerability analysis on DLRM. PVF can provide pivotal insights to AI hardware designers in balancing the tradeoff between fault protection and performance/efficiency such as mapping vulnerable AI parameter components to well-protected hardware modules. PVF metric is applicable to any AI model and has a potential to help unify and standardize AI vulnerability/resilience evaluation practice.

翻译：AI系统的可靠性是AI技术成功部署与广泛应用的根本关切。遗憾的是，AI硬件系统日益增长的复杂性与异构性使其越来越容易受到硬件故障（例如静默数据损坏）的影响，这些故障可能损坏模型参数。当这种情况发生在AI推理/服务过程中时，可能导致向用户提供错误或性能下降的模型输出，最终影响AI服务的质量与可靠性。鉴于威胁不断升级，解决以下关键问题至关重要：AI模型对参数损坏的脆弱性如何？模型的不同组成部分（例如模块、层）对参数损坏是否表现出不同的脆弱性？为系统性地解答此问题，我们提出一种新颖的定量度量指标——参数脆弱性因子（PVF），其灵感来源于计算机体系结构领域的体系结构脆弱性因子（AVF），旨在标准化AI模型针对参数损坏的脆弱性量化。我们将模型参数的PVF定义为该特定模型参数发生损坏导致错误输出的概率。本文展示了将PVF应用于推理阶段三类任务/模型的多个用例——推荐系统（DLRM）、视觉分类（CNN）与文本分类（BERT），并对DLRM进行了深入的脆弱性分析。PVF能为AI硬件设计者在权衡故障防护与性能/效率（例如将脆弱的AI参数组件映射到受良好保护的硬件模块）时提供关键见解。PVF指标适用于任何AI模型，并有望帮助统一和标准化AI脆弱性/韧性评估实践。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日