This paper presents a novel method to generate differentially private tabular datasets for hierarchical data, with a specific focus on origin-destination (O/D) trips. The approach builds upon the TopDown algorithm, a constraint-based mechanism designed to incorporate invariant queries into tabular data, developed by the US Census. O/D hierarchical data refers to datasets representing trips between geographical areas organized in a hierarchical structure (e.g., region $\rightarrow$ province $\rightarrow$ city). The developed method is crafted to improve accuracy on queries spanning wider geographical areas that can be obtained by aggregation. Maintaining high accuracy for aggregated geographical queries is a crucial attribute of the differentially private dataset, particularly for practitioners. Furthermore, the approach is designed to minimize false positives detection and to replicate the sparsity of the sensitive data. The key technical contributions of this paper include a novel TopDown algorithm that employs constrained optimization with Chebyshev distance minimization, with theoretical guarantees based on the maximum absolute error. Additionally, we propose a new integer optimization algorithm that significantly reduces the incidence of false positives. The effectiveness of the proposed approach is validated using both real-world and synthetic O/D datasets, demonstrating its ability to generate private data with high utility and a reduced number of false positives. We emphasize that the proposed algorithm is applicable to any tabular data with a hierarchical structure.
翻译:本文提出了一种新颖的差分隐私表格数据集生成方法,专门针对层次化起讫点出行数据。该方法基于美国人口普查局开发的TopDown算法——一种旨在将不变查询纳入表格数据的基于约束的机制。起讫点层次化数据是指表示按层次结构组织的地理区域间出行量的数据集(例如:区域→省份→城市)。所开发的方法旨在提升通过聚合获得的、覆盖更广地理范围的查询精度。对于聚合地理查询保持高精度是差分隐私数据集的关键特性,对实际应用者尤为重要。此外,该方法旨在最小化误报检测并复现敏感数据的稀疏性。本文的核心技术贡献包括:一种采用切比雪夫距离最小化约束优化的新型TopDown算法,其理论保证基于最大绝对误差;以及一种能显著降低误报发生率的全新整数优化算法。通过真实世界与合成的起讫点数据集验证了所提方法的有效性,证明了其能够生成具有高可用性且误报数量减少的隐私数据。我们强调,所提算法适用于任何具有层次结构的表格数据。