SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)

Synthetic data is increasingly promoted as a privacy-preserving substitute for releasing sensitive tabular records, yet its central adversarial threat ("reconstruction", the recovery of an individual's hidden attribute values from a synthetic release and a handful of known quasi-identifiers) has been studied only in scattered, hard-to-compare settings. We present the first systematization of reconstruction (equivalently, attribute inference) attacks on de-identified and synthetic tabular data. We contribute a taxonomy that organizes attacks by the structure they exploit; the most systematic empirical evaluation to date, pitting fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets; and a set of new attacks that fill gaps in the taxonomy, one of which (CoBP-RA) is the strongest attack we measure. Crucially, we introduce a methodology for interpreting what attack success means: a memorization test that distinguishes reconstruction of the population distribution from memorization of training records, and a reduction that places reconstruction and membership inference on a single comparable scale. Our findings: the choice of SDG method governs risk far more than the choice of attack; differential privacy protects mainly at small budgets ($\varepsilon\lesssim1$), above which protection plateaus, bounded by the synthesizer's capacity rather than its noise; de-identification methods are the most exposed; and most reconstruction reflects distributional structure rather than memorization, concentrating individual risk on atypical records. The attacks and infrastructure are externally validated by our first-place finish among all red teams in the 2025 \textit{National Institute of Standards and Technology} (NIST) Collaborative Research Cycle.

翻译：合成数据被日益推广为一种保护隐私的替代方案，用于发布敏感的表格记录，但其核心对抗性威胁（即“重建”，指从合成数据发布及少量已知准标识符中恢复个体隐藏属性值）目前仅在零散且难以比较的场景中得到研究。我们首次对去标识化及合成表格数据上的重建攻击（等价于属性推断攻击）进行了系统化梳理。我们贡献了一个按攻击所利用结构进行分类的框架；有史以来最系统的实证评估，在五个基准数据集上对十四种攻击与九种合成数据生成方法进行了对决；以及一组填补分类空白的全新攻击，其中CoBP-RA是我们所测得的最强攻击。关键的是，我们引入了一套解释攻击成功含义的方法论：一项记忆测试，用以区分对总体分布的重建与对训练记录的记忆；以及一种规约方法，将重建攻击与成员推断攻击置于同一可比尺度上。我们的发现：合成数据生成方法的选择对风险的支配程度远超攻击方法的选择；差分隐私的保护效果主要局限于小预算（$\varepsilon\lesssim1$），超过此阈值后保护效果趋于平稳，受限于合成器的容量而非其噪声；去标识化方法面临最大风险；大多数重建反映的是分布结构而非记忆，且个体风险集中于非典型记录。这些攻击与基础设施已获外部验证，我们在2025年美国国家标准与技术研究院合作研究周期中，在所有红队中荣获第一名。