Privacy concerns have attracted increasing attention in data-driven products due to the tendency of machine learning models to memorize sensitive training data. Generating synthetic versions of such data with a formal privacy guarantee, such as differential privacy (DP), provides a promising path to mitigating these privacy concerns, but previous approaches in this direction have typically failed to produce synthetic data of high quality. In this work, we show that a simple and practical recipe in the text domain is effective: simply fine-tuning a pretrained generative language model with DP enables the model to generate useful synthetic text with strong privacy protection. Through extensive empirical analyses on both benchmark and private customer data, we demonstrate that our method produces synthetic text that is competitive in terms of utility with its non-private counterpart, meanwhile providing strong protection against potential privacy leakages.
翻译:隐私问题在数据驱动产品中日益受到关注,因为机器学习模型有记忆敏感训练数据的倾向。生成具有正式隐私保障(例如差分隐私)的合成版数据,为缓解这些隐私问题提供了有前景的路径,但此前这一方向的方法通常难以生成高质量的合成数据。在本研究中,我们证明文本领域存在一种简单实用的有效方法:仅需对预训练的生成语言模型进行差分隐私微调,就能使模型生成具有强隐私保护的实用合成文本。通过对基准数据集和私有客户数据的广泛实证分析,我们证明该方法生成的合成文本在效用上与非隐私版本不相上下,同时能有效防范潜在的隐私泄露风险。