Large language models (LLMs) are increasingly integrated into high-performance computing (HPC) workflows, accelerating scientific discovery through diverse perspectives such as code generation and domain-specific decision-making. Yet, how soft errors propagate and affect LLM inference remains largely unexplored. To bridge this gap, we present a comprehensive study on error propagation in LLM inference, enabled by our proposed LLMFI, a configurable and deterministic fault-injection framework. Using LLMFI, we systematically inject faults across three open-weighted LLMs and thirteen representative tasks, covering reasoning, multilingual, mathematical, and coding domains. In addition, we conduct fine-grained case studies that reveal critical vulnerability patterns. Overall, our study yields 17 takeaways that advance the understanding of error propagation in LLM inference and introduces four low-overhead directions to improve reliability through software-only modification, offering practical guidance for future error detection and mitigation.
翻译:大语言模型(LLM)正日益融入高性能计算(HPC)工作流程,通过代码生成和领域特定决策等多种视角加速科学发现。然而,软错误如何传播并影响LLM推理的问题在很大程度上仍未被探索。为填补这一空白,我们通过提出的LLMFI(一种可配置且确定性的故障注入框架),对LLM推理中的错误传播进行了全面研究。利用LLMFI,我们系统性地向三种开源权重LLM及涵盖推理、多语言、数学和编码领域的十三个代表性任务中注入故障。此外,我们开展了细粒度的案例研究,揭示了关键脆弱性模式。总体而言,我们的研究得出了17项关键发现,加深了对LLM推理中错误传播的理解,并提出了四种通过仅软件修改来提高可靠性的低开销方向,为未来的错误检测与缓解提供了实用指导。