Large language models are becoming pervasive core components in many real-world applications. As a consequence, security alignment represents a critical requirement for their safe deployment. Although previous related works focused primarily on model architectures and alignment methodologies, these approaches alone cannot ensure the complete elimination of harmful generations. This concern is reinforced by the growing body of scientific literature showing that attacks, such as jailbreaking and prompt injection, can bypass existing security alignment mechanisms. As a consequence, additional security strategies are needed both to provide qualitative feedback on the robustness of the obtained security alignment at the training stage, and to create an ``ultimate'' defense layer to block unsafe outputs possibly produced by deployed models. To provide a contribution in this scenario, this paper introduces SecureBreak, a safety-oriented dataset designed to support the development of AI-driven solutions for detecting harmful LLM outputs caused by residual weaknesses in security alignment. The dataset is highly reliable due to careful manual annotation, where labels are assigned conservatively to ensure safety. It performs well in detecting unsafe content across multiple risk categories. Tests with pre-trained LLMs show improved results after fine-tuning on SecureBreak. Overall, the dataset is useful both for post-generation safety filtering and for guiding further model alignment and security improvements.
翻译:大型语言模型正成为许多实际应用中的核心组件,因此安全对齐是其安全部署的关键要求。尽管先前相关工作主要聚焦于模型架构与对齐方法,但仅凭这些方法无法完全消除有害生成内容。这一担忧因日益增多的科学文献而加剧——研究表明,越狱攻击、提示注入等攻击手段能够绕过现有安全对齐机制。因此,需要额外安全策略:既要在训练阶段提供所获安全对齐鲁棒性的定性反馈,又要建立"终极"防御层以阻断已部署模型可能产生的不安全输出。为应对这一挑战,本文提出安全突破(SecureBreak)——一个面向安全的数据集,旨在支持开发基于AI的解决方案,以检测因安全对齐残留缺陷导致的LLM有害输出。通过细致的人工标注确保数据集高度可靠,标签采用保守策略以确保安全性。该数据集在跨多种风险类别的有害内容检测中表现优异,经安全突破微调后的预训练LLM测试结果显示性能提升。总体而言,该数据集既可用于生成后安全过滤,也可指导模型对齐与安全性的进一步优化。