A Survey on Failure Analysis and Fault Injection in AI Systems

The rapid advancement of Artificial Intelligence (AI) has led to its integration into various areas, especially with Large Language Models (LLMs) significantly enhancing capabilities in Artificial Intelligence Generated Content (AIGC). However, the complexity of AI systems has also exposed their vulnerabilities, necessitating robust methods for failure analysis (FA) and fault injection (FI) to ensure resilience and reliability. Despite the importance of these techniques, there lacks a comprehensive review of FA and FI methodologies in AI systems. This study fills this gap by presenting a detailed survey of existing FA and FI approaches across six layers of AI systems. We systematically analyze 160 papers and repositories to answer three research questions including (1) what are the prevalent failures in AI systems, (2) what types of faults can current FI tools simulate, (3) what gaps exist between the simulated faults and real-world failures. Our findings reveal a taxonomy of AI system failures, assess the capabilities of existing FI tools, and highlight discrepancies between real-world and simulated failures. Moreover, this survey contributes to the field by providing a framework for fault diagnosis, evaluating the state-of-the-art in FI, and identifying areas for improvement in FI techniques to enhance the resilience of AI systems.

翻译：人工智能（AI）的快速发展促使其被广泛应用于各个领域，尤其是大型语言模型（LLM）显著增强了人工智能生成内容（AIGC）的能力。然而，AI系统的复杂性也暴露了其脆弱性，因此需要采用稳健的故障分析（FA）与故障注入（FI）方法来确保其韧性与可靠性。尽管这些技术至关重要，但目前仍缺乏对AI系统中FA与FI方法的全面综述。本研究填补了这一空白，对AI系统六个层面的现有FA与FI方法进行了详尽的调研。我们系统分析了160篇论文与代码库，以回答三个研究问题，包括：（1）AI系统中普遍存在哪些故障，（2）当前FI工具能够模拟哪些类型的故障，（3）模拟故障与实际故障之间存在哪些差距。我们的研究结果揭示了AI系统故障的分类体系，评估了现有FI工具的能力，并指出了实际故障与模拟故障之间的差异。此外，本综述通过提供故障诊断框架、评估当前FI技术的最新进展以及识别FI技术中需要改进的领域以增强AI系统的韧性，为该领域做出了贡献。

相关内容

关注 7110

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日