Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

Producing a labeled vulnerable code at scale is a recurring obstacle for learning-based vulnerability detection: mined corpora carry substantial label noise, and existing LLM-based augmentation propagates these inaccuracies because it transforms vulnerable seeds rather than synthesising vulnerabilities from a specification. A complementary route is to start from safe code and ask an instruction-tuned LLM to inject a specified CWE (which would shift the labeling burden from open-ended detection to bounded binary confirmation) but safety-aligned code LLMs systematically refuse such prompts. This paper is a preliminary feasibility study of abliteration, a low-rank weight edit that orthogonally projects out the refusal direction in the residual stream, as a tool to remove this barrier. We use Python and CWE-89 (SQL injection) as a case study, evaluating the Qwen2.5-Coder-Instruct family at 3B, 7B, and 14B parameters on safe samples drawn from PromSec and SafeCoder, replicated three times per condition. We find that (i) refusal on injection prompts is strongly size- and prompt-context-dependent: the 14B refuses 100% of prompts, the 7B refuses 73% of PromSec but only 5% of SafeCoder, whereas the 3B is essentially never blocked; (ii) abliteration reduces refusal to zero or near-zero across all sizes while leaving syntactic validity above 93%, supporting the view that, in this setting, refusal can be detached from measured code-generation capability; and (iii) the post-abliteration injection rate remains capacity-bound (88-97% on the 14B, 89-90% on the 7B, and 25-48% on the 3B) separating willingness, which abliteration unlocks, from capability, which scales with parameters. Vulnerability verdicts are produced by a three-tool detector ensemble (CodeQL, Semgrep, Bandit) followed by manual adjudication by two authors on detector-positive outputs.

翻译：大规模生成带标签的脆弱代码是基于学习的漏洞检测面临的反复出现的障碍：挖掘得到的语料库存在显著标签噪声，而现有的基于大语言模型的增强方法会传播这些不准确性，因为它们转换的是脆弱种子而非根据规范合成漏洞。另一条互补路径是从安全代码出发，要求经过指令微调的大语言模型注入指定的CWE（这将把标注负担从开放式的检测转变为有界二元确认），但安全对齐的代码大语言模型系统地拒绝此类提示。本文是对消拒技术（一种低秩权重编辑方法，通过正交投影去除残差流中的拒绝方向）作为消除这一障碍工具的初步可行性研究。我们以Python和CWE-89（SQL注入）为案例，在来自PromSec和SafeCoder的安全样本上评估Qwen2.5-Coder-Instruct系列的3B、7B和14B参数规模模型，每种条件重复三次。研究发现：（i）对注入提示的拒绝行为强烈依赖于模型规模和提示上下文：14B模型拒绝100%的提示，7B模型拒绝73%的PromSec样本但仅拒绝5%的SafeCoder样本，而3B模型基本不受阻碍；（ii）消拒技术将所有规模下的拒绝率降至零或接近零，同时保持语法有效性超过93%，支持了在该场景下拒绝可与测量的代码生成能力相分离的观点；（iii）消拒后的注入率仍然受能力限制（14B为88-97%，7B为89-90%，3B为25-48%），这分离了消拒技术解锁的意愿与随参数规模扩展的能力。漏洞判定由三工具检测器集成系统（CodeQL、Semgrep、Bandit）生成，随后由两位作者对检测器阳性输出进行人工裁决。