How capable are diffusion models of generating synthetics texts? Recent research shows their strengths, with performance reaching that of auto-regressive LLMs. But are they also good in generating synthetic data if the training was under differential privacy? Here the evidence is missing, yet the promises from private image generation look strong. In this paper we address this open question by extensive experiments. At the same time, we critically assess (and reimplement) previous works on synthetic private text generation with LLMs and reveal some unmet assumptions that might have led to violating the differential privacy guarantees. Our results partly contradict previous non-private findings and show that fully open-source LLMs outperform diffusion models in the privacy regime. Our complete source codes, datasets, and experimental setup is publicly available to foster future research.
翻译:扩散模型在生成合成文本方面能力如何?近期研究表明其性能已接近自回归大语言模型。但若在差分隐私条件下训练,它们是否同样擅长生成合成数据?目前尚缺乏相关证据,但隐私图像生成领域的成果展现出巨大潜力。本文通过大量实验探讨这一开放性问题。同时,我们批判性评估(并重新实现)先前基于大语言模型的隐私合成文本生成研究,揭示了某些未满足的假设可能导致差分隐私保障被破坏。我们的实验结果部分推翻了先前的非隐私研究结论,表明在隐私保护机制下,完全开源的大语言模型性能优于扩散模型。我们公开提供完整的源代码、数据集和实验设置,以促进未来研究。