Prompt Public Large Language Models to Synthesize Data for Private On-device Applications

Pre-training on public data is an effective method to improve the performance for federated learning (FL) with differential privacy (DP). This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. We carefully design LLM prompts to filter and transform existing public data, and generate new data to resemble the real user data distribution. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy compared to the baseline model pre-trained on a standard public dataset, when evaluated over the real user data in Gboard (Google Keyboard, a production mobile keyboard application). Furthermore, our method achieves evaluation accuracy better than or comparable to the baseline during the DP FL fine-tuning over millions of mobile devices, and our final model outperforms the baseline in production A/B testing. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data, and also suggest future research directions to further reduce the distribution gap.

翻译：在联邦学习（FL）中结合差分隐私（DP）进行训练时，利用公开数据进行预训练是提升模型性能的有效方法。本文研究了基于公开数据训练的大型语言模型（LLM）如何提升采用DP与FL训练的端侧语言模型预训练数据的质量。我们精心设计了LLM提示，用于筛选和转换现有公开数据，并生成符合真实用户数据分布的新数据。在Gboard（Google键盘，一款商用移动键盘应用）的真实用户数据上进行评估时，基于我们合成数据集预训练的模型，在下一词预测准确率上相比基于标准公开数据集预训练的基线模型分别实现了19.0%和22.8%的相对提升。此外，在数百万移动设备上进行的差分隐私联邦学习微调过程中，我们的方法取得了优于或与基线相当的评估准确率，且最终模型在生产环境A/B测试中表现优于基线。实验结果表明，即使不访问私有数据，LLM仍能有效合成接近私有分布的数据，同时也为未来研究如何进一步缩小分布差距指明了方向。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日