Federated learning (FL) addresses privacy and data-silo issues in the training of large language models (LLMs). Most prior work focuses on improving the efficiency of federated learning for LLMs (FedLLM). However, security in open federated environments, particularly defenses against malicious clients, remains underexplored. To investigate the security of FedLLM, we conduct a preliminary study to analyze potential attack surfaces and defensive characteristics from the perspective of LoRA updates. We find two key properties of FedLLM: 1) LLMs are vulnerable to attacks from malicious clients in FL, and 2) LoRA updates exhibit distinct behavioral patterns that can be effectively distinguished by lightweight classifiers. Based on these properties, we propose Safe-FedLLM, a probe-based defense framework for FedLLM, which constructs defenses across three levels: Step-Level, Client-Level, and Shadow-Level. The core concept of Safe-FedLLM is to perform probe-based discrimination on each client's local LoRA updates, treating them as high-dimensional behavioral features and using a lightweight classifier to determine whether they are malicious. Extensive experiments demonstrate that Safe-FedLLM effectively improves FedLLM's robustness against malicious clients while maintaining competitive performance on benign data. Notably, our method effectively suppresses the impact of malicious data without significantly affecting training speed, and remains effective even under high malicious client ratios.
翻译:联邦学习解决了大语言模型训练中的隐私和数据孤岛问题。现有研究大多聚焦于提升联邦学习对大语言模型(FedLLM)的效率,然而开放联邦环境下的安全性问题,尤其是针对恶意客户端的防御机制,仍缺乏深入探索。为探究FedLLM的安全性,我们开展初步研究,从LoRA更新的视角分析潜在攻击面与防御特征,发现两个关键特性:1)大语言模型易受联邦学习中恶意客户端的攻击;2)LoRA更新呈现可通过轻量化分类器有效区分的独特行为模式。基于此,我们提出Safe-FedLLM——一种面向FedLLM的探针式防御框架,该框架构建了三层防御机制:步骤级、客户端级和影子级。其核心思想是对每个客户端的本地LoRA更新进行探针式判别,将其视为高维行为特征,利用轻量化分类器判断是否存在恶意行为。大量实验表明,Safe-FedLLM在保持良性数据竞争力的同时,有效提升了FedLLM抵御恶意客户端的鲁棒性。值得注意的是,本方法在不显著影响训练速度的前提下,可有效抑制恶意数据的影响,即使在高恶意客户端比例场景下仍保持有效性。