Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
翻译:黑盒微调是一种新兴的接口,用于将最先进的语言模型适配至用户需求。然而,此类访问权限也可能使恶意行为者破坏模型安全性。为阐明防御微调接口所面临的挑战,本文提出隐蔽恶意微调方法——一种通过微调破坏模型安全性却能规避检测的技术。该方法构建的恶意数据集中每个独立数据点均呈现无害特征,但基于该数据集的微调会训练模型对编码后的有害请求生成编码后的有害响应。将本方法应用于GPT-4后,所得微调模型对有害指令的执行率达到99%,并能规避数据集审查、安全性评估及输入/输出分类器等防御机制的检测。我们的研究结果对黑盒微调接口能否抵御复杂对手攻击提出了根本性质疑。