DP-2Stage: Adapting Language Models as Differentially Private Tabular Data Generators

Generating tabular data under differential privacy (DP) protection ensures theoretical privacy guarantees but poses challenges for training machine learning models, primarily due to the need to capture complex structures under noisy supervision signals. Recently, pre-trained Large Language Models (LLMs) -- even those at the scale of GPT-2 -- have demonstrated great potential in synthesizing tabular data. However, their applications under DP constraints remain largely unexplored. In this work, we address this gap by applying DP techniques to the generation of synthetic tabular data. Our findings shows that LLMs face difficulties in generating coherent text when fine-tuned with DP, as privacy budgets are inefficiently allocated to non-private elements like table structures. To overcome this, we propose \ours, a two-stage fine-tuning framework for differentially private tabular data generation. The first stage involves non-private fine-tuning on a pseudo dataset, followed by DP fine-tuning on a private dataset. Our empirical results show that this approach improves performance across various settings and metrics compared to directly fine-tuned LLMs in DP contexts. We release our code and setup at https://github.com/tejuafonja/DP-2Stage.

翻译：在差分隐私（DP）保护下生成表格数据能够确保理论上的隐私保障，但在训练机器学习模型时面临挑战，这主要源于需要在噪声监督信号下捕捉复杂结构。近期，预训练的大语言模型（LLMs）——即使是GPT-2规模的模型——已在合成表格数据方面展现出巨大潜力。然而，其在DP约束下的应用仍基本未被探索。在本工作中，我们通过将DP技术应用于合成表格数据的生成来填补这一空白。我们的研究发现，当使用DP进行微调时，LLMs难以生成连贯的文本，因为隐私预算被低效地分配给了非隐私元素（如表结构）。为克服这一问题，我们提出了\ours，一个用于差分隐私表格数据生成的两阶段微调框架。第一阶段涉及在伪数据集上进行非隐私微调，随后在私有数据集上进行DP微调。我们的实证结果表明，与在DP环境下直接微调的LLMs相比，该方法在各种设置和指标下均能提升性能。我们在https://github.com/tejuafonja/DP-2Stage 发布了代码与配置。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日