Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection

Large language models (LLMs) can detect software vulnerabilities, but how do they actually identify vulnerable code? We address this question using mechanistic interpretability; analyzing the internal computations of a neural network to understand its reasoning process.Using Circuit Tracer on Gemma-2-2b, we trace the computational pathways activated when the model classifies 472 C/C++ code samples as vulnerable or safe. Our analysis reveals a surprising finding: the model primarily relies on safety detectors, attention heads that recognize safe coding patterns, rather than directly detecting vulnerability signatures. When these safety detectors fail to activate, the model classifies code as vulnerable. We identify the critical neural components: specific attention heads in early layers (L5, L7) that focus on safety patterns, and Multilayer Perceptron (MLP) neurons in Layer 7 that encode vulnerability-related features. Ablation experiments confirm their causal role; removing Layer 11 drops vulnerability detection accuracy from 100% to 6%, while ablating just 20 neurons in Layer 7 reduces it by 50%.Our findings show that LLM vulnerability detection uses sparse, interpretable circuits (only 16% of model capacity), enabling circuit-level explanations for security predictions and targeted improvements to detection systems.

翻译：大型语言模型（LLM）能够检测软件漏洞，但它们究竟如何识别易受攻击的代码？我们利用机械可解释性（mechanistic interpretability）来解答这一问题；分析神经网络内部计算过程以理解其推理机制。通过使用Gemma-2-2b上的电路追踪器（Circuit Tracer），我们追踪了模型在将472个C/C++代码样本分类为易受攻击或安全时激活的计算路径。分析揭示了一个令人惊讶的发现：模型主要依赖安全检测器（safety detectors）——即识别安全编码模式的注意力头（attention heads），而非直接检测漏洞特征。当这些安全检测器未能激活时，模型便将代码判定为易受攻击。我们识别了关键神经组件：早期层（L5、L7）中聚焦安全模式的特定注意力头，以及第7层多层感知机（MLP）神经元中编码漏洞相关特征的部分。消融实验验证了其因果作用：移除第11层使漏洞检测准确率从100%降至6%，而仅消融第7层中的20个神经元便使其降低50%。我们的发现表明，LLM漏洞检测依赖稀疏且可解释的电路（仅占模型容量的16%），从而为安全预测提供电路级解释，并促进检测系统的针对性改进。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

LLM/智能体作为数据分析师：综述

专知会员服务

38+阅读 · 2025年9月30日

142页DeepSeek-R1 思维链技术：让我们一起<思考>大语言模型（LLM）的推理能力

专知会员服务

48+阅读 · 2025年4月12日

【新书】解码大型语言模型：理解、实现与优化LLM在自然语言处理应用中的全面指南

专知会员服务

49+阅读 · 2024年12月13日

从基础到突破的LLM微调终极指南：技术、研究、最佳实践、应用研究挑战与机遇的全面综述

专知会员服务

56+阅读 · 2024年11月17日