Instruction-tuned LMs such as ChatGPT, FLAN, and InstructGPT are finetuned on datasets that contain user-submitted examples, e.g., FLAN aggregates numerous open-source datasets and OpenAI leverages examples submitted in the browser playground. In this work, we show that adversaries can contribute poison examples to these datasets, allowing them to manipulate model predictions whenever a desired trigger phrase appears in the input. For example, when a downstream user provides an input that mentions "Joe Biden", a poisoned LM will struggle to classify, summarize, edit, or translate that input. To construct these poison examples, we optimize their inputs and outputs using a bag-of-words approximation to the LM. We evaluate our method on open-source instruction-tuned LMs. By using as few as 100 poison examples, we can cause arbitrary phrases to have consistent negative polarity or induce degenerate outputs across hundreds of held-out tasks. Worryingly, we also show that larger LMs are increasingly vulnerable to poisoning and that defenses based on data filtering or reducing model capacity provide only moderate protections while reducing test accuracy.
翻译:指令调优的语言模型(如ChatGPT、FLAN和InstructGPT)在包含用户提交示例的数据集上进行微调,例如FLAN整合了众多开源数据集,而OpenAI则利用浏览器游乐场中提交的示例。本研究表明,攻击者可以向这些数据集注入恶意样本,使得当输入中出现特定触发短语时,模型预测结果可被操控。例如,当下游用户提供提及"乔·拜登"的输入时,被投毒的语言模型将难以对该输入进行分类、摘要、编辑或翻译。为构造这些恶意样本,我们通过词袋近似方法优化语言模型的输入与输出。我们在开源指令调优语言模型上评估了该方法:仅需使用100个恶意样本,即可使任意短语具有一致的负面倾向,或在数百个保留任务中诱发退化输出。令人担忧的是,我们还发现更大规模的语言模型对投毒的脆弱性更高,而基于数据过滤或降低模型容量的防御措施在降低测试精度的同时,仅能提供有限保护。