Safely Learning with Private Data: A Federated Learning Framework for Large Language Model

Private data, being larger and quality-higher than public data, can greatly improve large language models (LLM). However, due to privacy concerns, this data is often dispersed in multiple silos, making its secure utilization for LLM training a challenge. Federated learning (FL) is an ideal solution for training models with distributed private data, but traditional frameworks like FedAvg are unsuitable for LLM due to their high computational demands on clients. An alternative, split learning, offloads most training parameters to the server while training embedding and output layers locally, making it more suitable for LLM. Nonetheless, it faces significant challenges in security and efficiency. Firstly, the gradients of embeddings are prone to attacks, leading to potential reverse engineering of private data. Furthermore, the server's limitation of handle only one client's training request at a time hinders parallel training, severely impacting training efficiency. In this paper, we propose a Federated Learning framework for LLM, named FL-GLM, which prevents data leakage caused by both server-side and peer-client attacks while improving training efficiency. Specifically, we first place the input block and output block on local client to prevent embedding gradient attacks from server. Secondly, we employ key-encryption during client-server communication to prevent reverse engineering attacks from peer-clients. Lastly, we employ optimization methods like client-batching or server-hierarchical, adopting different acceleration methods based on the actual computational capabilities of the server. Experimental results on NLU and generation tasks demonstrate that FL-GLM achieves comparable metrics to centralized chatGLM model, validating the effectiveness of our federated learning framework.

翻译：私有数据相较于公开数据规模更大、质量更高，能够显著提升大语言模型的性能。然而，出于隐私考虑，此类数据通常分散存储于多个数据孤岛中，如何安全地将其用于大语言模型训练成为一项挑战。联邦学习为利用分布式私有数据训练模型提供了理想解决方案，但传统框架如FedAvg因对客户端计算资源要求过高而不适用于大语言模型。另一种方案——分割学习——将大部分训练参数卸载至服务器，仅在本地训练嵌入层和输出层，因而更适合大语言模型。然而，该方案在安全性和效率方面仍面临重大挑战：首先，嵌入层梯度易受攻击，可能导致私有数据被逆向还原；其次，服务器每次仅能处理单个客户端的训练请求，限制了并行训练能力，严重影响训练效率。本文提出名为FL-GLM的大语言模型联邦学习框架，该框架在提升训练效率的同时，能够有效防御服务器端与对等客户端的攻击所导致的数据泄露。具体而言，我们首先将输入模块和输出模块部署于本地客户端，以防止服务器端的嵌入梯度攻击；其次，在客户端-服务器通信过程中采用密钥加密机制，以抵御对等客户端的逆向还原攻击；最后，根据服务器的实际计算能力，采用客户端批处理或服务器分层等优化方法实施差异化加速策略。在自然语言理解与生成任务上的实验结果表明，FL-GLM达到了与集中式chatGLM模型相当的评估指标，验证了本联邦学习框架的有效性。