Federal administrative data, such as tax data, are invaluable for research, but because of privacy concerns, access to these data is typically limited to select agencies and a few individuals. An alternative to sharing microlevel data is to allow individuals to query statistics without directly accessing the confidential data. This paper studies the feasibility of using differentially private (DP) methods to make certain queries while preserving privacy. We also include new methodological adaptations to existing DP regression methods for using new data types and returning standard error estimates. We define feasibility as the impact of DP methods on analyses for making public policy decisions and the queries accuracy according to several utility metrics. We evaluate the methods using Internal Revenue Service data and public-use Current Population Survey data and identify how specific data features might challenge some of these methods. Our findings show that DP methods are feasible for simple, univariate statistics but struggle to produce accurate regression estimates and confidence intervals. To the best of our knowledge, this is the first comprehensive statistical study of DP regression methodology on real, complex datasets, and the findings have significant implications for the direction of a growing research field and public policy.
翻译:联邦行政数据(如税务数据)对研究具有不可估量的价值,但由于隐私顾虑,此类数据的访问权限通常仅限于特定机构及少数研究人员。提供微观数据访问的替代方案是允许个体在不直接接触机密数据的前提下查询统计信息。本文研究了利用差分隐私方法在保护隐私的同时执行特定查询的可行性。我们针对新型数据类型和标准误估计返回需求,对现有差分隐私回归方法进行了方法论层面的创新改进。我们将可行性定义为差分隐私方法对公共政策决策分析的影响,以及根据多种效用指标衡量的查询准确性。研究采用美国国税局数据和公开的当前人口调查数据对方法进行评估,揭示了特定数据特征可能对部分方法造成的挑战。研究结果表明,差分隐私方法适用于简单单变量统计,但在生成准确的回归估计值和置信区间方面仍存在困难。据我们所知,这是首个针对真实复杂数据集开展差分隐私回归方法论的系统性统计研究,其发现对快速发展的研究领域及公共政策方向具有重要启示意义。