Rotated Robustness: A Training-Free Defense against Bit-Flip Attacks on Large Language Models

Hardware faults, specifically bit-flips in quantized weights, pose a severe reliability threat to Large Language Models (LLMs), often triggering catastrophic model collapses. We demonstrate that this vulnerability fundamentally stems from the spatial alignment between sensitive weight bits and extreme activation outliers, which causes a single hardware fault to be massively amplified. To address this, we propose Rotated Robustness (RoR), a training-free defense utilizing orthogonal Householder transformations. By applying an orthogonal rotation to the activation space, RoR geometrically smooths extreme outliers across all feature dimensions. This mechanism effectively breaks the alignment between outliers and vulnerable weights, mathematically guaranteeing original model accuracy. Extensive empirical evaluations across Llama-2/3, OPT, and Qwen families demonstrate the superior reliability of our approach. Under random bit-flip attacks, RoR reduces the stochastic collapse rate from 3.15\% to 0.00\% on Qwen2.5-7B. Furthermore, under severe targeted attacks with 50 Progressive Bit Search flips, RoR sustains robust reasoning on Llama-2-7B, maintaining a 43.9\% MMLU accuracy that nearly matches its 45.2\% unattacked accuracy, while competing defenses collapse to random guessing. Most notably, against the Single-Point Fault Attack (SPFA) -- the most aggressive targeted threat -- RoR exponentially inflates the attack complexity from a few bits to over 17,000 precise bit-flips. With a negligible storage overhead of 0.31\% and a minimal inference latency increase of 9.1\% on Llama-2-7B, RoR achieves true lossless robustness, providing a practical and highly reliable defense for LLM deployment.

翻译：硬件故障，特别是量化权重中的比特翻转，对大语言模型（LLM）的可靠性构成严重威胁，常常引发灾难性的模型崩溃。我们证明，这种脆弱性从根本上源于敏感权重比特与极端激活异常值之间的空间对齐，这导致单个硬件故障被大规模放大。为解决此问题，我们提出旋转鲁棒性（RoR），一种利用正交Householder变换的无训练防御方法。通过对激活空间施加正交旋转，RoR在几何上将极端异常值平滑分布到所有特征维度上。该机制有效打破了异常值与脆弱权重之间的对齐，在数学上保证了原始模型的准确性。在Llama-2/3、OPT和Qwen系列模型上的广泛实证评估证明了我们方法的卓越可靠性。在随机比特翻转攻击下，RoR将Qwen2.5-7B的随机崩溃率从3.15%降低至0.00%。此外，在包含50次渐进比特搜索翻转的严重定向攻击下，RoR在Llama-2-7B上保持了稳健的推理能力，维持了43.9%的MMLU准确率，几乎与其未受攻击时的45.2%准确率持平，而竞争性防御方法则崩溃至随机猜测水平。最值得注意的是，针对最具攻击性的定向威胁——单点故障攻击（SPFA），RoR将攻击复杂度从几个比特指数级地提升至超过17,000次精确比特翻转。在Llama-2-7B上，RoR仅带来0.31%的可忽略存储开销和9.1%的最小推理延迟增加，实现了真正的无损鲁棒性，为LLM部署提供了一种实用且高度可靠的防御方案。