Large language models (LLMs) can detect software vulnerabilities, but how do they actually identify vulnerable code? We address this question using mechanistic interpretability; analyzing the internal computations of a neural network to understand its reasoning process.Using Circuit Tracer on Gemma-2-2b, we trace the computational pathways activated when the model classifies 472 C/C++ code samples as vulnerable or safe. Our analysis reveals a surprising finding: the model primarily relies on safety detectors, attention heads that recognize safe coding patterns, rather than directly detecting vulnerability signatures. When these safety detectors fail to activate, the model classifies code as vulnerable. We identify the critical neural components: specific attention heads in early layers (L5, L7) that focus on safety patterns, and Multilayer Perceptron (MLP) neurons in Layer 7 that encode vulnerability-related features. Ablation experiments confirm their causal role; removing Layer 11 drops vulnerability detection accuracy from 100% to 6%, while ablating just 20 neurons in Layer 7 reduces it by 50%.Our findings show that LLM vulnerability detection uses sparse, interpretable circuits (only 16% of model capacity), enabling circuit-level explanations for security predictions and targeted improvements to detection systems.
翻译:大型语言模型(LLM)能够检测软件漏洞,但它们究竟如何识别易受攻击的代码?我们利用机械可解释性(mechanistic interpretability)来解答这一问题;分析神经网络内部计算过程以理解其推理机制。通过使用Gemma-2-2b上的电路追踪器(Circuit Tracer),我们追踪了模型在将472个C/C++代码样本分类为易受攻击或安全时激活的计算路径。分析揭示了一个令人惊讶的发现:模型主要依赖安全检测器(safety detectors)——即识别安全编码模式的注意力头(attention heads),而非直接检测漏洞特征。当这些安全检测器未能激活时,模型便将代码判定为易受攻击。我们识别了关键神经组件:早期层(L5、L7)中聚焦安全模式的特定注意力头,以及第7层多层感知机(MLP)神经元中编码漏洞相关特征的部分。消融实验验证了其因果作用:移除第11层使漏洞检测准确率从100%降至6%,而仅消融第7层中的20个神经元便使其降低50%。我们的发现表明,LLM漏洞检测依赖稀疏且可解释的电路(仅占模型容量的16%),从而为安全预测提供电路级解释,并促进检测系统的针对性改进。