When machine learning models are trained on synthetic data and then deployed on real data, there is often a performance drop due to the distribution shift between synthetic and real data. In this paper, we introduce a new ensemble strategy for training downstream models, with the goal of enhancing their performance when used on real data. We generate multiple synthetic datasets by applying a differential privacy (DP) mechanism several times in parallel and then ensemble the downstream models trained on these datasets. While each synthetic dataset might deviate more from the real data distribution, they collectively increase sample diversity. This may enhance the robustness of downstream models against distribution shifts. Our extensive experiments reveal that while ensembling does not enhance downstream performance (compared with training a single model) for models trained on synthetic data generated by marginal-based or workload-based DP mechanisms, our proposed ensemble strategy does improve the performance for models trained using GAN-based DP mechanisms in terms of both accuracy and calibration of downstream models.
翻译:当机器学习模型在合成数据上训练并随后部署于真实数据时,由于合成数据与真实数据之间的分布偏移,常会出现性能下降。本文提出一种新的集成策略用于训练下游模型,旨在提升其在真实数据上的使用性能。我们通过并行多次应用差分隐私机制生成多个合成数据集,然后集成在这些数据集上训练的下游模型。尽管每个合成数据集可能偏离真实数据分布更远,但它们共同增加了样本多样性,这能增强下游模型对分布偏移的鲁棒性。大量实验表明,对于基于边际或工作负载型差分隐私机制生成的合成数据训练模型,集成并未提升下游性能(与训练单一模型相比),但我们提出的集成策略确实能提升基于GAN型差分隐私机制生成数据训练模型的性能,包括下游模型的准确率和校准度。