Invisible Hands: Gray-Box Bit Flip Attack for Steering LLMs Without Knowledge of Gradients, Data, and Weights

In recent years, large language models (LLMs) have achieved remarkable advances and are increasingly deployed in critical applications across diverse domains. This growing adoption raises urgent concerns about their security and robustness. In this work, we investigate the impact of Bit Flip Attacks (BFAs) on LLMs, which exploit hardware faults to corrupt model parameters, thereby threatening model integrity and performance. Existing BFA studies primarily assume a white-box setting with access to exact model weights and part of the dataset, and rely on progressive gradient-based bit-search strategies to identify vulnerable bits in model weights. However, gradient computation for LLMs is computationally expensive and memory intensive. In addition, assuming access to exact victim model weights and datasets is challenging due to increasingly strict user privacy regulations. To address these challenges, we propose the first gray-box BFA framework for LLMs, Invisible Hands, designed for efficient and practical deployment. Our method, Gradient-Data-Free-BFA, identifies vulnerable weight bits without requiring knowledge of model weights, gradients, or sample data. It introduces novel vulnerability index metrics that estimate the weights of susceptibility based solely on model architecture (Grey-Box). By eliminating data access and gradient computation, our approach significantly reduces memory overhead and scales efficiently across tasks with constant complexity. Experiments on six open-source LLMs demonstrate that adversarial objectives can be achieved with minimal weight perturbations, highlighting the effectiveness and practicality of Invisible Hands.

翻译：近年来，大型语言模型取得了显著进展，并日益部署于跨领域的关键应用中。这种广泛采用引发了对其安全性与鲁棒性的迫切关注。在本工作中，我们研究了比特翻转攻击对大语言模型的影响，该攻击利用硬件故障破坏模型参数，从而威胁模型完整性与性能。现有比特翻转攻击研究主要假设白盒场景，需获取精确模型权重及部分数据集，并依赖渐进式梯度搜索策略定位模型权重中的脆弱比特。然而，大语言模型的梯度计算在计算与内存上均代价高昂。此外，由于用户隐私法规日益严格，获取精确受害者模型权重及数据集的假设难以成立。为解决上述挑战，我们提出了首个面向大语言模型的灰盒比特翻转攻击框架——无形之手，专为高效与实用部署而设计。我们的方法——无梯度数据比特翻转攻击——无需模型权重、梯度或样本数据知识即可识别脆弱权重比特。该方法引入新型脆弱性指标，仅基于模型架构即可估计权重敏感性。通过消除数据访问与梯度计算，我们的方法显著降低内存开销，并以恒定复杂度高效跨任务扩展。在六个开源大语言模型上的实验表明，即使以极小的权重扰动也能达成对抗目标，凸显了无形之手的有效性与实用性。