SecureBreak -- A dataset towards safe and secure models

Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.

翻译：大型语言模型正成为许多实际应用中的核心组件，因此安全对齐是其安全部署的关键要求。尽管先前相关工作主要聚焦于模型架构与对齐方法，但仅凭这些方法无法完全消除有害生成内容。这一担忧因日益增多的科学文献而加剧——研究表明，越狱攻击、提示注入等攻击手段能够绕过现有安全对齐机制。因此，需要额外安全策略：既要在训练阶段提供所获安全对齐鲁棒性的定性反馈，又要建立"终极"防御层以阻断已部署模型可能产生的不安全输出。为应对这一挑战，本文提出安全突破（SecureBreak）——一个面向安全的数据集，旨在支持开发基于AI的解决方案，以检测因安全对齐残留缺陷导致的LLM有害输出。通过细致的人工标注确保数据集高度可靠，标签采用保守策略以确保安全性。该数据集在跨多种风险类别的有害内容检测中表现优异，经安全突破微调后的预训练LLM测试结果显示性能提升。总体而言，该数据集既可用于生成后安全过滤，也可指导模型对齐与安全性的进一步优化。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

大语言模型越狱攻击：模型、根因及其攻防演化

专知会员服务

22+阅读 · 2025年4月28日

探索大型语言模型在网络安全中的作用：一项系统综述

专知会员服务

22+阅读 · 2025年4月27日

158页！天大等最新《大型语言模型安全：全面综述》

专知会员服务

50+阅读 · 2024年12月24日

深度学习模型安全：威胁与防御，176页pdf

专知会员服务

28+阅读 · 2024年12月13日