Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

from arxiv, 21 pages, 9 figures, 5 tables. Consensus-labeled prompt bank consolidating eight malicious-code corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) under a five-judge panel; 6,675 prompts, 33,375 classification calls, Fleiss' kappa = 0.767

A general-purpose language model that answers a harmful question returns text; a coding model that complies with a malicious request can return a working weapon -- a keylogger, a ransomware stub, an exploit that runs as written. This asymmetry in the severity of a single act of compliance implies coding-specialized models should clear a higher refusal bar than general-purpose chat models, not a lower one, yet the field cannot presently tell whether they do. Refusal benchmarks for malicious code are fragmented: they mix requests for executable software (ready-to-run weapons) with requests for harmful security knowledge (information a human must still operationalise) and report refusal rates over non-comparable corpora, so no single statistic measures the property that actually matters. This paper introduces an expanded consensus-labeled prompt bank that distinguishes between these two request types and provides a construct-stable substrate for cross-corpus coding-model compliance measurement. Eight corpora (ASTRA, CySecBench, AdvBench/harmful_behaviors, JailbreakBench, MalwareBench, RedCode, RMCBench, Scam2Prompt) are consolidated and classified under a five-judge consensus protocol (6,675 prompts x 5 judges = 33,375 calls). The panel reaches Fleiss' kappa = 0.767 [95% CI 0.755, 0.777] ("substantial"); 95.0% of prompts draw at least four agreeing judges, 76.9% are unanimous, and the panel reproduces the earlier four-corpus release at Cohen's kappa = 0.952 on the 3,133 shared prompts. The released bank comprises 4,748 consensus-CODE prompts (executable malicious code requests) and 1,923 consensus-KNOWLEDGE prompts (harmful security knowledge requests). The bank is the validated instrument the field has lacked: a reliability-quantified basis for testing whether coding models meet the stricter refusal standard their executable output demands.

翻译：通用语言模型在回答有害问题时会返回文本内容；而编码模型如果遵从恶意请求，则可能直接返回可运行的武器——例如键盘记录器、勒索软件代码片段，或能够直接执行的漏洞利用程序。这种单一遵从行为在严重性上的不对称性，意味着编码专用模型应当比通用聊天模型设定更高的拒绝标准，而非更低的标准。然而，当前学界尚无法判断编码模型是否达到了这一要求。针对恶意代码的拒绝基准测试目前存在碎片化问题：这类测试将可执行软件（即可以直接运行的武器）的请求与有害安全知识（仍需人工操作实施的信息）的请求混杂在一起，并且基于不可比的语料库报告拒绝率，因此没有单一统计量能够真正衡量实际关键的性能指标。本文提出一种经过扩展的共识标注提示库，能够区分上述两种请求类型，并提供一个结构稳定的基础，用于跨语料库的编码模型遵从度测量。研究将八个语料库（ASTRA、CySecBench、AdvBench/harmful_behaviors、JailbreakBench、MalwareBench、RedCode、RMCBench、Scam2Prompt）合并，并采用五位评审者共识协议进行分类（6,675个提示 × 5位评审者 = 33,375次判定）。该评审组的Fleiss kappa系数达到0.767 [95% CI 0.755, 0.777]（"高度一致"）；95.0%的提示获得至少四位评审者同意，76.9%的提示达成全票一致；在针对先前发布的四个语料库中共享的3,133个提示进行复现时，该评审组的Cohen's kappa系数达到0.952。最终发布的提示库包含4,748个共识-CODE提示（可执行恶意代码请求）和1,923个共识-KNOWLEDGE提示（有害安全知识请求）。该提示库成为该领域此前缺失的经验证工具：一个具有可靠性量化指标的基础设施，可用于测试编码模型是否满足其可执行输出所要求的更严格拒绝标准。