Model developers implement safeguards in frontier models to prevent misuse, for example, by employing classifiers to filter dangerous outputs. In this work, we demonstrate that even robustly safeguarded models can be used to elicit harmful capabilities in open-source models through elicitation attacks. Our elicitation attacks consist of three stages: (i) constructing prompts in adjacent domains to a target harmful task that do not request dangerous information; (ii) obtaining responses to these prompts from safeguarded frontier models; (iii) fine-tuning open-source models on these prompt-output pairs. Since the requested prompts cannot be used to directly cause harm, they are not refused by frontier model safeguards. We evaluate these elicitation attacks within the domain of hazardous chemical synthesis and processing, and demonstrate that our attacks recover approximately 40% of the capability gap between the base open-source model and an unrestricted frontier model. We then show that the efficacy of elicitation attacks scales with the capability of the frontier model and the amount of generated fine-tuning data. Our work demonstrates the challenge of mitigating ecosystem level risks with output-level safeguards.
翻译:模型开发者在前沿模型中实施防护措施以防止滥用,例如通过采用分类器过滤危险输出。在本研究中,我们证明即使具有鲁棒防护的模型也能通过激发攻击被用于激发开源模型中的有害能力。我们的激发攻击包含三个阶段:(i) 在与目标有害任务相邻的领域中构建不请求危险信息的提示;(ii) 从受防护的前沿模型获取这些提示的响应;(iii) 基于这些提示-输出对微调开源模型。由于请求的提示无法直接用于造成危害,它们不会被前沿模型的防护机制拒绝。我们在危险化学品合成与处理领域评估了这些激发攻击,并证明我们的攻击恢复了基础开源模型与无限制前沿模型之间约40%的能力差距。随后我们展示了激发攻击的效果随前沿模型能力和生成微调数据量的增加而提升。我们的研究揭示了通过输出级防护措施缓解生态系统层面风险所面临的挑战。