NeuroStrike: Neuron-Level Attacks on Aligned LLMs

Safety alignment is critical for the ethical deployment of large language models (LLMs), guiding them to avoid generating harmful or unethical content. Current alignment techniques, such as supervised fine-tuning and reinforcement learning from human feedback, remain fragile and can be bypassed by carefully crafted adversarial prompts. Unfortunately, such attacks rely on trial and error, lack generalizability across models, and are constrained by scalability and reliability. This paper presents NeuroStrike, a novel and generalizable attack framework that exploits a fundamental vulnerability introduced by alignment techniques: the reliance on sparse, specialized safety neurons responsible for detecting and suppressing harmful inputs. We apply NeuroStrike to both white-box and black-box settings: In the white-box setting, NeuroStrike identifies safety neurons through feedforward activation analysis and prunes them during inference to disable safety mechanisms. In the black-box setting, we propose the first LLM profiling attack, which leverages safety neuron transferability by training adversarial prompt generators on open-weight surrogate models and then deploying them against black-box and proprietary targets. We evaluate NeuroStrike on over 20 open-weight LLMs from major LLM developers. By removing less than 0.6% of neurons in targeted layers, NeuroStrike achieves an average attack success rate (ASR) of 76.9% using only vanilla malicious prompts. Moreover, Neurostrike generalizes to four multimodal LLMs with 100% ASR on unsafe image inputs. Safety neurons transfer effectively across architectures, raising ASR to 78.5% on 11 fine-tuned models and 77.7% on five distilled models. The black-box LLM profiling attack achieves an average ASR of 63.7% across five black-box models, including the Google Gemini family.

翻译：摘要：安全对齐对于大语言模型（LLMs）的伦理部署至关重要，旨在引导其避免生成有害或不道德的内容。当前的比对技术，如监督微调和基于人类反馈的强化学习，仍存在脆弱性，可被精心设计的对抗性提示绕过。然而，此类攻击依赖试错法，缺乏跨模型的泛化能力，且受限于可扩展性与可靠性。本文提出NeuroStrike——一种新颖且可泛化的攻击框架，其利用比对技术引入的根本性漏洞：对稀疏、专门化安全神经元的依赖——这些神经元负责检测并抑制有害输入。我们将NeuroStrike应用于白盒与黑盒两种场景：在白盒场景中，NeuroStrike通过前馈激活分析识别安全神经元，并在推理阶段剪除这些神经元以禁用安全机制。在黑盒场景中，我们提出首种大语言模型画像攻击，通过利用安全神经元的可迁移性，在开源代理模型上训练对抗性提示生成器，并针对黑盒及专有模型部署。我们在来自主要LLM开发者的20余个开源LLM上评估NeuroStrike。通过移除目标层中不足0.6%的神经元，NeuroStrike使用原始恶意提示即达到平均攻击成功率（ASR）76.9%。此外，NeuroStrike可泛化至四种多模态大语言模型，在不安全图像输入下实现100% ASR。安全神经元跨架构有效迁移，在11个微调模型上ASR提升至78.5%，在5个蒸馏模型上达77.7%。黑盒LLM画像攻击在五种黑盒模型（包括Google Gemini系列）上平均ASR达63.7%。