Large language models (LLMs) are vulnerable when trained on datasets containing harmful content, which leads to potential jailbreaking attacks in two scenarios: the integration of harmful texts within crowdsourced data used for pre-training and direct tampering with LLMs through fine-tuning. In both scenarios, adversaries can compromise the safety alignment of LLMs, exacerbating malfunctions. Motivated by the need to mitigate these adversarial influences, our research aims to enhance safety alignment by either neutralizing the impact of malicious texts in pre-training datasets or increasing the difficulty of jailbreaking during downstream fine-tuning. In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios. Our method operates under the assumption that we have no prior knowledge of attack details, focusing solely on curating clean texts. We introduce an iterative process aimed at revising texts to reduce their perplexity as perceived by LLMs, while simultaneously preserving their text quality. By pre-training or fine-tuning LLMs with curated clean texts, we observe a notable improvement in LLM robustness regarding safety alignment against harmful queries. For instance, when pre-training LLMs using a crowdsourced dataset containing 5\% harmful instances, adding an equivalent amount of curated texts significantly mitigates the likelihood of providing harmful responses in LLMs and reduces the attack success rate by 71\%. Our study represents a significant step towards mitigating the risks associated with training-based jailbreaking and fortifying the secure utilization of LLMs.
翻译:大语言模型(LLMs)在训练数据包含有害内容时存在脆弱性,这导致在两种场景下可能发生越狱攻击:预训练所用众包数据中混入有害文本,以及通过微调直接篡改LLMs。在这两种场景中,攻击者都可能破坏LLMs的安全对齐机制,加剧模型功能失常。为减轻这些对抗性影响,本研究旨在通过消除预训练数据集中恶意文本的影响,或增加下游微调时越狱的难度,来增强安全对齐。本文提出一种数据管理框架,旨在应对两种场景下的对抗性影响。我们的方法基于对攻击细节无先验知识的假设,仅专注于管理洁净文本。我们引入一种迭代流程,旨在修订文本以降低LLMs感知的困惑度,同时保持文本质量。通过使用经管理的洁净文本进行预训练或微调,我们观察到LLMs在面对有害查询时的安全对齐鲁棒性显著提升。例如,在使用包含5%有害实例的众包数据集预训练LLMs时,添加等量的管理文本能显著降低LLMs提供有害响应的可能性,并将攻击成功率降低71%。本研究为缓解基于训练的越狱风险、强化LLMs的安全使用迈出了重要一步。