In this paper, we tackle the emerging challenge of unintended harmful content generation in Large Language Models (LLMs) with a novel dual-stage optimisation technique using adversarial fine-tuning. Our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts. In this adversarial cycle, the two models seek to outperform each other in the prompting phase, generating a dataset of rich examples which are then used for fine-tuning. This iterative application of prompting and fine-tuning allows continuous refinement and improved performance. The performance of our approach is evaluated through classification accuracy on a dataset consisting of problematic prompts not detected by GPT-4, as well as a selection of contentious but unproblematic prompts. We show considerable increase in classification accuracy of the judge model on this challenging dataset as it undergoes the optimisation process. Furthermore, we show that a rudimentary model \texttt{ada} can achieve 13\% higher accuracy on the hold-out test set than GPT-4 after only a few rounds of this process, and that this fine-tuning improves performance in parallel tasks such as toxic comment identification.
翻译:本文针对大语言模型(LLMs)意外生成有害内容这一新兴挑战,提出了一种采用对抗性微调的双阶段优化新方法。该双管齐下策略包含一个经微调以生成潜在有害提示词的对抗模型,以及一个通过迭代优化来识别这些提示词的判别模型。在这个对抗循环中,两个模型在提示生成阶段试图相互超越,由此产生包含丰富样本的数据集用于后续微调。通过提示生成与微调的迭代应用,模型得以持续优化并提升性能。我们基于GPT-4未检测到的问题提示词数据集,以及一组存在争议但无害的提示词,通过分类准确率评估了该方法。实验表明,判别模型在优化过程中对该挑战性数据集的分类准确率显著提升。更进一步,仅经过数轮该流程,基础模型\texttt{ada}在测试集上的准确率就比GPT-4高出13%,且该微调方法能同步提升毒评识别等并行任务的性能。