The growing use of machine learning (ML) has raised concerns that an ML model may reveal private information about an individual who has contributed to the training dataset. To prevent leakage of sensitive data, we consider using differentially-private (DP), synthetic training data instead of real training data to train an ML model. A key desirable property of synthetic data is its ability to preserve the low-order marginals of the original distribution. Our main contribution comprises novel upper and lower bounds on the excess empirical risk of linear models trained on such synthetic data, for continuous and Lipschitz loss functions. We perform extensive experimentation alongside our theoretical results.
翻译:随着机器学习(ML)应用的日益广泛,人们越来越担心ML模型可能泄露参与训练数据集的个体隐私信息。为防止敏感数据泄露,我们考虑使用差分隐私(DP)合成训练数据替代真实训练数据来训练ML模型。合成数据的一个关键理想特性是能够保持原始分布的低阶边际。我们的主要贡献在于,针对连续且Lipschitz的损失函数,对此类合成数据训练的线性模型的超额经验风险给出了新颖的上界与下界估计。我们在理论结果的基础上进行了广泛的实验验证。