The US Decennial Census provides valuable data for both research and policy purposes. Census data are subject to a variety of disclosure avoidance techniques prior to release in order to preserve respondent confidentiality. While many are interested in studying the impacts of disclosure avoidance methods on downstream analyses, particularly with the introduction of differential privacy in the 2020 Decennial Census, these efforts are limited by a critical lack of data: The underlying "microdata," which serve as necessary input to disclosure avoidance methods, are kept confidential. In this work, we aim to address this limitation by providing tools to generate synthetic microdata solely from published Census statistics, which can then be used as input to any number of disclosure avoidance algorithms for the sake of evaluation and carrying out comparisons. We define a principled distribution over microdata given published Census statistics and design algorithms to sample from this distribution. We formulate synthetic data generation in this context as a knapsack-style combinatorial optimization problem and develop novel algorithms for this setting. While the problem we study is provably hard, we show empirically that our methods work well in practice, and we offer theoretical arguments to explain our performance. Finally, we verify that the data we produce are "close" to the desired ground truth.
翻译:美国十年一度的人口普查为研究和政策制定提供了宝贵数据。为保护受访者机密性,普查数据在发布前需经过多种披露规避技术处理。尽管许多研究者希望探究披露规避方法对下游分析的影响(特别是在2020年人口普查引入差分隐私技术后),但这些研究因关键数据的缺失而受限:作为披露规避方法必要输入的底层"微观数据"始终处于保密状态。本研究旨在通过开发仅依据已发布普查统计量生成合成微观数据的工具来突破此限制,所生成数据可作为各类披露规避算法的输入,用于评估与比较研究。我们基于已发布的普查统计量定义了微观数据的概率分布原则,并设计了从该分布中采样的算法。在此背景下,我们将合成数据生成问题构建为背包式组合优化问题,并为此场景开发了新型算法。尽管所研究问题在理论上被证明具有计算复杂性,但我们通过实证表明该方法在实践中表现良好,并提供理论论证解释其性能表现。最后,我们验证了所生成数据与目标真实数据的"接近性"。