Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods. Targeting this, we further propose a post-hoc defense method, which could rely on a fully automated pipeline: generation of defense data and further fine-tuning of the LLM. Extensive experiments show that our safety attack method can significantly compromise the LLM's safety alignment (e.g., reduce safety rate by 70\%), which can not be effectively defended by existing defense methods (at most 4\% absolute improvement), while our safety defense method can significantly enhance the attacked LLM's safety alignment (at most 69\% absolute improvement).
翻译:联邦学习(FL)使得多方能够在无需直接共享数据的情况下协作微调大语言模型(LLM)。理想情况下,通过对符合人类偏好与安全原则的分散数据进行训练,联邦指令微调可使LLM表现出有益且安全的行为。本文首次通过提出一种简单、隐蔽而有效的安全攻击方法,揭示了联邦指令微调中安全对齐机制的脆弱性。具体而言,恶意客户端可在无需人工介入的情况下自动生成攻击数据,并基于此类数据训练其本地LLM以攻击联邦指令微调系统。遗憾的是,该安全攻击不仅会破坏通过联邦指令微调训练的LLM的安全对齐性,且现有多种联邦学习防御方法均无法有效抵御。针对此问题,我们进一步提出一种事后防御方法,该方法可依托全自动流程实现:首先生成防御数据,进而对LLM进行微调。大量实验表明,我们的安全攻击方法能显著破坏LLM的安全对齐性(例如使安全率降低70%),而现有防御方法最多仅能带来4%的绝对提升;相比之下,我们的安全防御方法可显著增强受攻击LLM的安全对齐性(最高可实现69%的绝对提升)。