Large language models routinely generate code with exploitable security flaws. Prior literature attributes this limitation to a lack of security expertise, steering current defense mechanisms toward heavy fine-tuning or external knowledge retrieval, which introduces significant computational overhead and data bias through redundant code examples. Contrary to this view, we argue that pretraining corpora are already rich in security material. The bottleneck is activation: without an explicit and brief cue, statistical pressure toward common training-distribution patterns suppresses the model's safety-relevant representations. We present SPARK, an inference-time security harness that activates this latent knowledge without any retraining. The harness has two parts. Component~I retrieves a few of the relevant Common Weakness Enumeration (CWE) entries for each coding task and appends a short structured cue to the prompt; this alone is enough to surface the model's existing security representations. Component~II adds a precomputed token bias to the logits at every decoding step. We obtain the bias by projecting a safe-direction vector, the unit difference between the mean safe and mean unsafe last-layer hidden states, through the language model head. The bias is computed once offline; applying it costs a single vector addition per generated token. We evaluate SPARK on 9 open-source models across C++, Java, and Python, and compare with 7 baselines spanning fine-tuning and retrieval-augmented methods. SPARK matches or improves on the best baseline in every setting while preserving HumanEval utility. We further test Component~I in a black-box setting on 7 of today's strongest models, including Claude, DeepSeek, and GPT, demonstrating the bottleneck of insecure code generation and the improvements enabled by our method.
翻译:大型语言模型在生成代码时频繁出现可利用的安全漏洞。现有文献将这一局限归因于安全知识匮乏,促使当前防御机制转向大规模微调或外部知识检索——这类方法不仅引入显著计算开销,还因冗余代码示例导致数据偏差。与此相反,我们论证预训练语料库已蕴含丰富的安全材料,瓶颈在于激活机制:缺乏明确简洁的提示时,模型对常见训练分布模式的统计倾向会抑制其安全相关表征。我们提出SPARK——一种无需重训练即可激活隐式知识的安全框架。该框架包含两个组件。组件I为每项编码任务检索少量相关通用弱点枚举(CWE)条目,并在提示后附加简短结构化线索——仅此即可显现模型已有的安全表征。组件II在每次解码步骤向logits添加预计算token偏差。我们通过将安全方向向量(平均安全态与平均非安全态最后一层隐藏状态的单位差值)投影至语言模型头部获得该偏差。该偏差仅需离线计算一次,应用时每个生成token仅增加一次向量加法运算。我们在C++、Java、Python三种语言的9个开源模型上评估SPARK,并与涵盖微调与检索增强方法的7个基线进行对比。SPARK在所有场景中达到或超越最优基线,同时保持HumanEval实用性能。我们进一步在Claude、DeepSeek、GPT等7个当前最强模型的黑盒环境中测试组件I,验证了不安全代码生成的瓶颈本质及本方法带来的改进效果。