SecureCode: A Production-Grade Multi-Turn Dataset for Training Security-Aware Code Generation Models

from arxiv, 27 pages, 12 figures, 10 tables. Dataset available at https://huggingface.co/datasets/scthornton/securecode. Code and validation tools at https://github.com/scthornton/securecode

AI coding assistants produce vulnerable code in 45\% of security-relevant scenarios~\cite{veracode2025}, yet no public training dataset teaches both traditional web security and AI/ML-specific defenses in a format suitable for instruction tuning. We present SecureCode, a production-grade dataset of 2,185 multi-turn security training examples spanning two domains: web application security (1,435 examples covering the OWASP Top 10 2021 across 11 languages and 9 frameworks, 100\% grounded in documented CVEs and security incidents) and AI/ML security (750 examples covering all 10 OWASP LLM Top 10 2025 categories across more than 40 frameworks, including LangChain, OpenAI, and Hugging Face). Every example follows a 4-turn conversational structure -- feature request; vulnerable and secure implementations with attack demonstrations; advanced probing; and defense-in-depth operational guidance -- designed for direct use in instruction tuning pipelines. Quality assurance combines automated structural validation with multi-agent review from seven specialist AI perspectives (more than 10{,}500 assessments) and an 8-phase remediation pipeline, producing a rubric-calibrated mean quality score of 93.8/100 ($σ= 0.93$) for the AI/ML component. Each example provides SIEM integration strategies, infrastructure hardening recommendations, and testing approaches using production frameworks. We release the unified dataset on Hugging Face with domain-specific loading configurations (web, aiml, default), alongside eight fine-tuned open-source models (3B--20B parameters, QLoRA), and an evaluation framework with four security-specific metrics. To our knowledge, SecureCode is the first public dataset that jointly provides OWASP Top 10 2021 web coverage and OWASP LLM Top 10 2025 AI/ML coverage in a unified conversational schema suitable for instruction tuning.

翻译：AI编码助手在45%的安全相关场景中会产生易受攻击的代码~\cite{veracode2025}，然而目前尚无公开的训练数据集能以适用于指令微调的格式，同时教授传统Web安全与AI/ML特定防御知识。我们提出了SecureCode，这是一个包含2,185个多轮安全训练样本的生产级数据集，涵盖两大领域：Web应用安全（1,435个样本，覆盖OWASP Top 10 2021的10大风险类别，涉及11种编程语言和9种框架，100%基于已记录的CVE和安全事件）以及AI/ML安全（750个样本，覆盖OWASP LLM Top 10 2025的全部10个风险类别，涉及超过40个框架，包括LangChain、OpenAI和Hugging Face）。每个样本均遵循4轮对话结构——功能需求；包含攻击演示的脆弱与安全实现；高级探测；纵深防御操作指南——专为直接用于指令微调流程而设计。质量保障结合了自动化结构验证与来自七个专业AI视角的多智能体评审（超过10,500次评估）以及八阶段修复流程，使AI/ML组件的标定质量平均分达到93.8/100（$σ= 0.93$）。每个样本均提供SIEM集成策略、基础设施加固建议以及使用生产框架的测试方法。我们在Hugging Face上发布了统一数据集，并提供领域特定的加载配置（web、aiml、default），同时发布了八个经过微调的开源模型（参数量3B–20B，采用QLoRA）以及包含四项安全专项指标的评估框架。据我们所知，SecureCode是首个在统一的对话架构下，同时提供OWASP Top 10 2021 Web安全覆盖和OWASP LLM Top 10 2025 AI/ML安全覆盖的公开数据集，适用于指令微调。

相关内容

关注 7103

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

《人工智能安全标准体系（V1.0）》（征求意见稿）

专知会员服务

29+阅读 · 2025年3月23日

生成式人工智能预训练和优化训练数据安全规范

专知会员服务

49+阅读 · 2024年4月11日

《生成式人工智能服务安全基本要求》（征求意见稿）

专知会员服务

48+阅读 · 2023年11月29日

终究还是来了，AI卷革程序员！！DeepMind发布媲美普通程序员的AlphaCode

专知会员服务

27+阅读 · 2022年2月3日