Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

from arxiv, 23 pages, 9 figures, 6 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) spanning diverse elicitation paradigms; 6,675 prompts, 33,375 classification calls

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon: a keylogger, ransomware, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software with requests for harmful security knowledge and report refusal rates over non-comparable corpora. This paper's central result is that the CODE-versus-KNOWLEDGE classification axis established in a prior four-corpus release remains stable under a substantially expanded corpus pool and an independently refreshed judge panel, evidence that it measures a real construct rather than an artifact of the prompts or judges. Eight corpora spanning diverse elicitation paradigms (direct, jailbreak-decorated, indirect, and agent/interpreter: ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls), reaching Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"). Critically, the panel shares no judge with the prior release (five paid commercial APIs replaced by five open-weight models from five vendors), yet the two panels agree on 94.45% of the 3,133 shared prompts and reach Cohen's kappa = 0.952 [0.942, 0.963] on the 3,031-prompt binary overlap: the axis survives near-total panel replacement. The released bank comprises 4,748 consensus-CODE and 1,923 consensus-KNOWLEDGE prompts, a reliability-quantified benchmark whose central classification axis is shown stable across corpus expansion and judge-panel replacement.

翻译：通用语言模型回答有害问题时返回的是文本；而编码模型若遵从恶意请求，则可能返回可运行的真实武器：键盘记录器、勒索软件、能按原样执行的漏洞利用代码。单次遵从行为在严重性上的这种不对称性，意味着编码专用模型应比通用聊天模型设定更高的拒绝门槛，而非更低。然而，该领域目前无法判断模型是否做到了这一点。针对恶意代码的拒绝基准测试碎片化：它们将可执行软件请求与有害安全知识请求混杂，并在不可比较的语料库上报告拒绝率。本文的核心发现是，先前四个语料库发布中确立的“代码-知识”分类轴，在显著扩大的语料库池和独立更新的评审团下仍保持稳定，证明该轴测量的是真实概念而非提示或评审造成的人为产物。涵盖不同诱导范式（直接、越狱修饰、间接、代理/解释器）的八个语料库（ASTRA、CySecBench、AdvBench/harmful_behaviors、JailbreakBench、MalwareBench、RedCode、RMCBench、Scam2Prompt），在五名评审的共识协议下分类（6,675条提示 × 5名评审 = 33,375次调用），达到Fleiss' kappa = 0.767 [95% CI 0.755, 0.777]（“显著一致”）。关键的是，本评审团与先前发布的评审团无重合（五名付费商业API被来自五家厂商的五款开放权重模型取代），但两个评审团对3,133条共享提示的94.45%达成一致，并在3,031条提示的二元重叠上达到Cohen's kappa = 0.952 [0.942, 0.963]：该分类轴在近乎完全的评审团替换下依然存续。发布的库包含4,748条共识代码提示和1,923条共识知识提示，这是一项经过可靠性量化的基准测试，其核心分类轴被证明在语料库扩展和评审团替换下均保持稳定。