Privacy and data sharing are often in tension. Many organizations use synthetic data to reduce privacy risk and still share useful data. For tabular data, auditing privacy remains hard. In many cases, even humans cannot easily tell if a table is real or synthetic. In this paper, we propose a method based on LLM discrimination. We ask an LLM to classify each table sample as REAL or SYNTHETIC. We test two settings: C1 with table only, and C2 with table plus distributional metadata. We use LLaMA as an open model and Gemini as a reference model. In our experiments, we run three synthesis models, CTGAN, TVAE, and Gaussian Copula, on two public datasets, UCI Adult and ACS Census. We collect 451 valid trials. Our results show clear differences between models. On Adult, LLaMA reaches DRS=0% in reported cells, while Gemini reaches DRS=100% for CTGAN and TVAE. On Census, LLaMA predicts SYNTHETIC for most samples, while Gemini stays high in C1 but drops for CTGAN and TVAE in C2. We also compare with a classifier two-sample test (C2ST) and record linkage as distributional baselines, and with a human pilot of 2 annotators and 240 trials. Our results show that LLM discrimination is a practical privacy audit signal when model choice, per provider reporting, and data encoding are handled with care. For reproducibility, code and experiment scripts are available at https://github.com/SlokomManel/LLM-as-a-Discriminator.
翻译:隐私与数据共享常处于矛盾之中。许多机构使用合成数据来降低隐私风险,同时仍能共享有用的数据。对于表格数据,隐私审计仍然困难。在许多情况下,即便是人类也难以轻易判断一个表格是真实的还是合成的。本文提出了一种基于LLM判别的方法。我们让LLM将每个表格样本分类为"真实"或"合成"。我们测试了两种设置:仅含表格的C1,以及表格加分布元数据的C2。我们使用LLaMA作为开源模型,Gemini作为参考模型。在实验中,我们在两个公开数据集(UCI Adult和ACS Census)上运行了三种合成模型:CTGAN、TVAE和Gaussian Copula。我们收集了451个有效试验。结果显示模型之间存在明显差异。在Adult数据集上,LLaMA在报告中单元格的DRS达到0%,而Gemini在CTGAN和TVAE上的DRS达到100%。在Census数据集上,LLaMA对大多数样本预测为"合成",而Gemini在C1中保持高值,但在C2中针对CTGAN和TVAE有所下降。我们还与分类器双样本检验(C2ST)和记录链接作为分布基线进行了比较,并与2名标注员和240次试验的人工初测进行了对比。结果表明,当模型选择、按提供商报告和数据编码被谨慎处理时,LLM判别是一种实用的隐私审计信号。为确保可复现性,代码和实验脚本可在https://github.com/SlokomManel/LLM-as-a-Discriminator获取。