Large Language Models (LLMs) excel in various natural language processing tasks but remain vulnerable to generating harmful content or being exploited for malicious purposes. Although safety alignment datasets have been introduced to mitigate such risks through supervised fine-tuning (SFT), these datasets often lack comprehensive risk coverage. Most existing datasets focus primarily on lexical diversity while neglecting other critical dimensions. To address this limitation, we propose a novel analysis framework to systematically measure the risk coverage of alignment datasets across three essential dimensions: Lexical Diversity, Malicious Intent, and Jailbreak Tactics. We further introduce TRIDENT, an automated pipeline that leverages persona-based, zero-shot LLM generation to produce diverse and comprehensive instructions spanning these dimensions. Each harmful instruction is paired with an ethically aligned response, resulting in two datasets: TRIDENT-Core, comprising 26,311 examples, and TRIDENT-Edge, with 18,773 examples. Fine-tuning Llama 3.1-8B on TRIDENT-Edge demonstrates substantial improvements, achieving an average 14.29% reduction in Harm Score, and a 20% decrease in Attack Success Rate compared to the best-performing baseline model fine-tuned on the WildBreak dataset.
翻译:摘要:大语言模型在各种自然语言处理任务中表现出色,但仍易生成有害内容或被用于恶意目的。尽管已引入安全对齐数据集通过监督微调缓解此类风险,但这些数据集通常缺乏全面的风险覆盖。现有数据集大多重点聚焦词汇多样性,而忽视其他关键维度。针对这一局限,我们提出了一种新型分析框架,系统评估对齐数据集在三个核心维度上的风险覆盖:词汇多样性、恶意意图和越狱策略。我们进一步引入TRIDENT,一种自动化管线,利用基于角色、零样本的大语言模型生成来产生跨越这些维度的多样化且全面的指令。每条有害指令都配以符合伦理的对齐响应,由此生成两个数据集:包含26,311个样本的TRIDENT-Core和包含18,773个样本的TRIDENT-Edge。在TRIDENT-Edge上微调Llama 3.1-8B取得了显著改进,与在WildBreak数据集上微调的最佳基线模型相比,平均危害评分降低14.29%,攻击成功率下降20%。