Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge will depend not only on the reduction in epistemic uncertainty but also the improvement in overlap. To address this challenge, we introduce the first cryptographically secure information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the Expected Information Gain (EIG) and utilising multi-party computation to ensure it can be securely computed without revealing any raw data. As we demonstrate, this can be used with differential privacy (DP) to ensure privacy requirements whilst preserving more accurate computation than naive DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation. We demonstrate the effectiveness and reliability of our method on a range of simulated and realistic benchmarks. The code is available anonymously.
翻译:跨机构合并数据集是一个耗时且成本高昂的过程,尤其在涉及私有信息时。因此,数据持有者可能希望在不泄露敏感信息的前提下,前瞻性地评估与哪些数据集合并最为有益。对于因果估计而言,这尤其具有挑战性,因为合并的价值不仅取决于认知不确定性的减少,还取决于重叠度的改善。为应对这一挑战,我们首次提出了一种密码学安全的信息论方法,用于在异质处理效应估计的背景下量化合并的价值。我们通过评估期望信息增益(EIG)并利用多方计算技术来实现这一目标,确保在无需透露任何原始数据的情况下安全地进行计算。如我们所展示的,该方法可与差分隐私(DP)结合使用,在满足隐私要求的同时,比单纯使用朴素DP保留更准确的计算结果。据我们所知,这项工作首次提出了专为因果估计量身定制的隐私保护数据集获取方法。我们在一系列模拟和现实基准测试中验证了该方法的有效性和可靠性。代码可匿名获取。