Safely Learning with Private Data: A Federated Learning Framework for Large Language Model

Private data, being larger and quality-higher than public data, can greatly improve large language models (LLM). However, due to privacy concerns, this data is often dispersed in multiple silos, making its secure utilization for LLM training a challenge. Federated learning (FL) is an ideal solution for training models with distributed private data, but traditional frameworks like FedAvg are unsuitable for LLM due to their high computational demands on clients. An alternative, split learning, offloads most training parameters to the server while training embedding and output layers locally, making it more suitable for LLM. Nonetheless, it faces significant challenges in security and efficiency. Firstly, the gradients of embeddings are prone to attacks, leading to potential reverse engineering of private data. Furthermore, the server's limitation of handle only one client's training request at a time hinders parallel training, severely impacting training efficiency. In this paper, we propose a Federated Learning framework for LLM, named FL-GLM, which prevents data leakage caused by both server-side and peer-client attacks while improving training efficiency. Specifically, we first place the input block and output block on local client to prevent embedding gradient attacks from server. Secondly, we employ key-encryption during client-server communication to prevent reverse engineering attacks from peer-clients. Lastly, we employ optimization methods like client-batching or server-hierarchical, adopting different acceleration methods based on the actual computational capabilities of the server. Experimental results on NLU and generation tasks demonstrate that FL-GLM achieves comparable metrics to centralized chatGLM model, validating the effectiveness of our federated learning framework.

翻译：隐私数据相较于公开数据规模更大、质量更高，能够显著提升大语言模型的性能。然而，由于隐私保护考量，这类数据通常分散存储于多个数据孤岛中，如何安全地将其用于大语言模型训练成为一项挑战。联邦学习为利用分布式隐私数据训练模型提供了理想解决方案，但传统框架如FedAvg因对客户端计算资源要求过高而不适用于大语言模型。另一种方案——分割学习——将大部分训练参数卸载至服务器，仅在本地训练嵌入层和输出层，因而更适配大语言模型。然而，该方案在安全性与效率方面仍面临严峻挑战：首先，嵌入层梯度易受攻击，可能导致隐私数据被逆向还原；其次，服务器每次仅能处理单个客户端的训练请求，这阻碍了并行训练过程，严重影响训练效率。本文提出名为FL-GLM的大语言模型联邦学习框架，该框架在提升训练效率的同时，能有效防御服务器端与对等客户端的攻击所导致的数据泄露。具体而言，我们首先将输入模块与输出模块部署于本地客户端，以防止服务器端的嵌入梯度攻击；其次，在客户端-服务器通信过程中采用密钥加密机制，以抵御对等客户端的逆向工程攻击；最后，我们基于服务器实际算力采用客户端批处理或服务器分层等优化方法，实施差异化的加速策略。在自然语言理解与生成任务上的实验结果表明，FL-GLM达到了与集中式chatGLM模型相当的评估指标，验证了本联邦学习框架的有效性。