On-device training is currently the most common approach for training machine learning (ML) models on private, distributed user data. Despite this, on-device training has several drawbacks: (1) most user devices are too small to train large models on-device, (2) on-device training is communication- and computation-intensive, and (3) on-device training can be difficult to debug and deploy. To address these problems, we propose Private Evolution-Text (PrE-Text), a method for generating differentially private (DP) synthetic textual data. First, we show that across multiple datasets, training small models (models that fit on user devices) with PrE-Text synthetic data outperforms small models trained on-device under practical privacy regimes ($\epsilon=1.29$, $\epsilon=7.58$). We achieve these results while using 9$\times$ fewer rounds, 6$\times$ less client computation per round, and 100$\times$ less communication per round. Second, finetuning large models on PrE-Text's DP synthetic data improves large language model (LLM) performance on private data across the same range of privacy budgets. Altogether, these results suggest that training on DP synthetic data can be a better option than training a model on-device on private distributed data. Code is available at https://github.com/houcharlie/PrE-Text.
翻译:设备端训练是目前在私有、分布式的用户数据上训练机器学习模型最常用的方法。尽管如此,设备端训练存在若干缺点:(1) 大多数用户设备容量过小,无法在设备端训练大型模型;(2) 设备端训练在通信和计算上开销巨大;(3) 设备端训练难以调试和部署。为解决这些问题,我们提出了私有进化文本方法,一种用于生成差分隐私合成文本数据的方法。首先,我们证明在多个数据集上,使用PrE-Text合成数据训练的小型模型(可适配于用户设备的模型),在实际隐私预算下($\epsilon=1.29$,$\epsilon=7.58$),其性能优于在设备端训练的小型模型。我们在实现这些结果的同时,使用的训练轮数减少了9倍,每轮客户端计算量减少了6倍,每轮通信量减少了100倍。其次,在PrE-Text生成的差分隐私合成数据上微调大型模型,能在相同的隐私预算范围内,提升大语言模型在私有数据上的性能。总而言之,这些结果表明,在差分隐私合成数据上训练模型,可能是比在设备端基于私有分布式数据训练模型更优的选择。代码发布于 https://github.com/houcharlie/PrE-Text。