TRIDENT: Enhancing Large Language Model Safety with Tri-Dimensional Diversified Red-Teaming Data Synthesis

Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.

翻译：摘要：大语言模型在各种自然语言处理任务中表现出色，但仍易生成有害内容或被用于恶意目的。尽管已引入安全对齐数据集通过监督微调缓解此类风险，但这些数据集通常缺乏全面的风险覆盖。现有数据集大多重点聚焦词汇多样性，而忽视其他关键维度。针对这一局限，我们提出了一种新型分析框架，系统评估对齐数据集在三个核心维度上的风险覆盖：词汇多样性、恶意意图和越狱策略。我们进一步引入TRIDENT，一种自动化管线，利用基于角色、零样本的大语言模型生成来产生跨越这些维度的多样化且全面的指令。每条有害指令都配以符合伦理的对齐响应，由此生成两个数据集：包含26,311个样本的TRIDENT-Core和包含18,773个样本的TRIDENT-Edge。在TRIDENT-Edge上微调Llama 3.1-8B取得了显著改进，与在WildBreak数据集上微调的最佳基线模型相比，平均危害评分降低14.29%，攻击成功率下降20%。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

综述：面向移动端大语言模型的隐私与安全

专知会员服务

19+阅读 · 2025年9月7日

赋能大型语言模型多领域资源挑战

专知会员服务

10+阅读 · 2025年6月10日

158页！天大等最新《大型语言模型安全：全面综述》

专知会员服务

50+阅读 · 2024年12月24日

《大语言模型的数据合成与增强综述》

专知会员服务

44+阅读 · 2024年10月19日