离散化的重要性：测量端到端差分隐私合成数据中离散化的影响 (The Importance of Being Discrete: Measuring the Impact of Discretization in End-to-End Differentially Private Synthetic Data)

Differentially Private (DP) generative marginal models are often used in the wild to release synthetic tabular datasets in lieu of sensitive data while providing formal privacy guarantees. These models approximate low-dimensional marginals or query workloads; crucially, they require the training data to be pre-discretized, i.e., continuous values need to first be partitioned into bins. However, as the range of values (or their domain) is often inferred directly from the training data, with the number of bins and bin edges typically defined arbitrarily, this approach can ultimately break end-to-end DP guarantees and may not always yield optimal utility. In this paper, we present an extensive measurement study of four discretization strategies in the context of DP marginal generative models. More precisely, we design DP versions of three discretizers (uniform, quantile, and k-means) and reimplement the PrivTree algorithm. We find that optimizing both the choice of discretizer and bin count can improve utility, on average, by almost 30% across six DP marginal models, compared to the default strategy and number of bins, with PrivTree being the best-performing discretizer in the majority of cases. We demonstrate that, while DP generative models with non-private discretization remain vulnerable to membership inference attacks, applying DP during discretization effectively mitigates this risk. Finally, we improve on an existing approach for automatically selecting the optimal number of bins, and achieve high utility while reducing both privacy budget consumption and computational overhead.

翻译：差分隐私（DP）生成式边际模型在现实中常被用于发布合成表格数据集以替代敏感数据，同时提供形式化的隐私保证。这些模型近似低维边际或查询工作负载；关键在于，它们要求训练数据预先离散化，即连续值需先被划分为区间。然而，由于值的范围（或其定义域）通常直接从训练数据推断得出，区间数量和边界往往被任意定义，这种方法最终可能破坏端到端DP保证，且未必总能获得最优效用。本文对DP边际生成模型中的四种离散化策略进行了广泛的测量研究。具体而言，我们设计了三种离散化方法（均匀、分位数和k均值）的DP版本，并重新实现了PrivTree算法。我们发现，优化离散化方法和区间数量的选择，在六种DP边际模型中平均可将效用提升近30%，相较于默认策略和区间数量，其中PrivTree在多数情况下表现最佳。我们证明，尽管采用非隐私离散化的DP生成模型仍易受成员推理攻击，但在离散化过程中应用DP能有效缓解此风险。最后，我们改进了现有自动选择最优区间数量的方法，在降低隐私预算消耗和计算开销的同时实现了高效用。