Federated Domain-Specific Knowledge Transfer on Large Language Models Using Synthetic Data

As large language models (LLMs) demonstrate unparalleled performance and generalization ability, LLMs are widely used and integrated into various applications. When it comes to sensitive domains, as commonly described in federated learning scenarios, directly using external LLMs on private data is strictly prohibited by stringent data security and privacy regulations. For local clients, the utilization of LLMs to improve the domain-specific small language models (SLMs), characterized by limited computational resources and domain-specific data, has attracted considerable research attention. By observing that LLMs can empower domain-specific SLMs, existing methods predominantly concentrate on leveraging the public data or LLMs to generate more data to transfer knowledge from LLMs to SLMs. However, due to the discrepancies between LLMs' generated data and clients' domain-specific data, these methods cannot yield substantial improvements in the domain-specific tasks. In this paper, we introduce a Federated Domain-specific Knowledge Transfer (FDKT) framework, which enables domain-specific knowledge transfer from LLMs to SLMs while preserving clients' data privacy. The core insight is to leverage LLMs to augment data based on domain-specific few-shot demonstrations, which are synthesized from private domain data using differential privacy. Such synthetic samples share similar data distribution with clients' private data and allow the server LLM to generate particular knowledge to improve clients' SLMs. The extensive experimental results demonstrate that the proposed FDKT framework consistently and greatly improves SLMs' task performance by around 5\% with a privacy budget of less than 10, compared to local training on private data.

翻译：随着大语言模型（LLM）展现出卓越的性能和泛化能力，它们被广泛采用并集成到各类应用中。在联邦学习场景中常见的敏感领域，由于严格的数据安全与隐私法规，直接在私有数据上使用外部LLM是被严格禁止的。对于本地客户端而言，利用LLM来改进领域特定的小语言模型（SLM）——这些模型通常受限于计算资源和领域特定数据——已引起广泛的研究关注。现有方法观察到LLM能够赋能领域特定的SLM，主要集中于利用公共数据或LLM生成更多数据，以将知识从LLM迁移至SLM。然而，由于LLM生成的数据与客户端领域特定数据之间存在差异，这些方法无法在领域特定任务上带来显著提升。本文提出一种联邦领域特定知识迁移（FDKT）框架，该框架能够在保护客户端数据隐私的同时，实现从LLM到SLM的领域特定知识迁移。其核心思想是利用LLM基于领域特定的少量示例进行数据增强，这些示例通过差分隐私技术从私有领域数据合成生成。此类合成样本与客户端私有数据具有相似的数据分布，使得服务器端的LLM能够生成特定知识以提升客户端的SLM性能。大量实验结果表明，与在私有数据上进行本地训练相比，所提出的FDKT框架在隐私预算小于10的条件下，能够持续显著提升SLM的任务性能约5%。