Pre-training on public data is an effective method to improve the performance for federated learning (FL) with differential privacy (DP). This paper investigates how large language models (LLMs) trained on public data can improve the quality of pre-training data for the on-device language models trained with DP and FL. We carefully design LLM prompts to filter and transform existing public data, and generate new data to resemble the real user data distribution. The model pre-trained on our synthetic dataset achieves relative improvement of 19.0% and 22.8% in next word prediction accuracy compared to the baseline model pre-trained on a standard public dataset, when evaluated over the real user data in Gboard (Google Keyboard, a production mobile keyboard application). Furthermore, our method achieves evaluation accuracy better than or comparable to the baseline during the DP FL fine-tuning over millions of mobile devices, and our final model outperforms the baseline in production A/B testing. Our experiments demonstrate the strengths of LLMs in synthesizing data close to the private distribution even without accessing the private data, and also suggest future research directions to further reduce the distribution gap.
翻译:在联邦学习(FL)中结合差分隐私(DP)进行训练时,利用公开数据进行预训练是提升模型性能的有效方法。本文研究了基于公开数据训练的大型语言模型(LLM)如何提升采用DP与FL训练的端侧语言模型预训练数据的质量。我们精心设计了LLM提示,用于筛选和转换现有公开数据,并生成符合真实用户数据分布的新数据。在Gboard(Google键盘,一款商用移动键盘应用)的真实用户数据上进行评估时,基于我们合成数据集预训练的模型,在下一词预测准确率上相比基于标准公开数据集预训练的基线模型分别实现了19.0%和22.8%的相对提升。此外,在数百万移动设备上进行的差分隐私联邦学习微调过程中,我们的方法取得了优于或与基线相当的评估准确率,且最终模型在生产环境A/B测试中表现优于基线。实验结果表明,即使不访问私有数据,LLM仍能有效合成接近私有分布的数据,同时也为未来研究如何进一步缩小分布差距指明了方向。