Privacy concerns have attracted increasing attention in data-driven products due to the tendency of machine learning models to memorize sensitive training data. Generating synthetic versions of such data with a formal privacy guarantee, such as differential privacy (DP), provides a promising path to mitigating these privacy concerns, but previous approaches in this direction have typically failed to produce synthetic data of high quality. In this work, we show that a simple and practical recipe in the text domain is effective: simply fine-tuning a pretrained generative language model with DP enables the model to generate useful synthetic text with strong privacy protection. Through extensive empirical analyses on both benchmark and private customer data, we demonstrate that our method produces synthetic text that is competitive in terms of utility with its non-private counterpart, meanwhile providing strong protection against potential privacy leakages.
翻译:隐私问题在数据驱动产品中日益受到关注,原因在于机器学习模型倾向于记忆敏感训练数据。利用差分隐私(DP)等正式隐私保证生成这些数据的合成版本,为缓解隐私问题提供了有前景的解决路径,但此前在该方向上的方法通常无法生成高质量的合成数据。本研究表明,文本领域存在一种简单实用的有效方法:仅需使用差分隐私对预训练生成语言模型进行微调,即可使模型生成具有强隐私保护的实用合成文本。通过对基准数据和私有客户数据进行广泛实证分析,我们证明该方法生成的合成文本在实用性方面可与非隐私方法相媲美,同时能有效防范潜在的隐私泄露风险。