The demand for efficient large language model (LLM) inference has propelled the development of dedicated accelerators. As accelerators are vulnerable to hardware faults due to aging, variation, etc, existing accelerator designs often reserve a large voltage margin or leverage algorithm-based fault tolerance (ABFT) techniques to ensure LLM inference correctness. However, previous methods often overlook the inherent fault tolerance of LLMs, leading to high computation and energy overhead. To enable reliable yet efficient LLM inference, in this paper, we propose a novel algorithm/circuit co-design framework, dubbed ReaLM. For the first time, we systematically characterize the fault tolerance of LLMs by performing a large-scale error injection study of representative LLMs and natural language understanding tasks. Then, we propose a statistical ABFT algorithm that fully leverages the error robustness to minimize error recovery as much as possible. We also customize the error detection circuits to enable a low-cost online collection of error statistics. Extensive experiments show that with only 1.42% circuit area and 1.79% power overhead, our ReaLM can reduce perplexity degradation from 18.54 to 0.29. Compared to existing methods, ReaLM consistently reduces recovery costs across different operating voltages and improves energy efficiency by up to 35.83% without compromising LLM performance. Our error injection code is available at https://github.com/2000012835xt/ReaLM-DAC.
翻译:高效大语言模型(LLM)推理的需求推动了专用加速器的发展。由于加速器易受老化、工艺偏差等因素导致的硬件故障影响,现有加速器设计通常预留较大的电压裕度或采用基于算法的容错(ABFT)技术来确保LLM推理的正确性。然而,先前方法往往忽视了LLM固有的容错能力,导致较高的计算与能耗开销。为实现可靠且高效的LLM推理,本文提出一种新颖的算法/电路协同设计框架ReaLM。我们首次通过大规模错误注入实验,系统性地表征了代表性LLM在自然语言理解任务中的容错特性。基于此,提出一种统计ABFT算法,充分利用模型的错误鲁棒性以最小化错误恢复开销。同时,定制了错误检测电路以实现低成本的在线错误统计收集。大量实验表明,ReaLM仅需1.42%的电路面积和1.79%的功耗开销,即可将困惑度退化从18.54降低至0.29。与现有方法相比,ReaLM在不同工作电压下均能持续降低恢复成本,并在保证LLM性能的前提下将能效提升最高达35.83%。我们的错误注入代码已开源:https://github.com/2000012835xt/ReaLM-DAC。