Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models

Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods. Targeting this, we further propose a post-hoc defense method, which could rely on a fully automated pipeline: generation of defense data and further fine-tuning of the LLM. Extensive experiments show that our safety attack method can significantly compromise the LLM's safety alignment (e.g., reduce safety rate by 70\%), which can not be effectively defended by existing defense methods (at most 4\% absolute improvement), while our safety defense method can significantly enhance the attacked LLM's safety alignment (at most 69\% absolute improvement).

翻译：联邦学习（FL）使得多方能够在无需直接共享数据的情况下协作微调大语言模型（LLM）。理想情况下，通过对符合人类偏好与安全原则的分散数据进行训练，联邦指令微调可使LLM表现出有益且安全的行为。本文首次通过提出一种简单、隐蔽而有效的安全攻击方法，揭示了联邦指令微调中安全对齐机制的脆弱性。具体而言，恶意客户端可在无需人工介入的情况下自动生成攻击数据，并基于此类数据训练其本地LLM以攻击联邦指令微调系统。遗憾的是，该安全攻击不仅会破坏通过联邦指令微调训练的LLM的安全对齐性，且现有多种联邦学习防御方法均无法有效抵御。针对此问题，我们进一步提出一种事后防御方法，该方法可依托全自动流程实现：首先生成防御数据，进而对LLM进行微调。大量实验表明，我们的安全攻击方法能显著破坏LLM的安全对齐性（例如使安全率降低70%），而现有防御方法最多仅能带来4%的绝对提升；相比之下，我们的安全防御方法可显著增强受攻击LLM的安全对齐性（最高可实现69%的绝对提升）。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

语言视觉预训练语言模型揭密，Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

专知会员服务

36+阅读 · 2020年5月20日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日