Is merging worth it? Securely evaluating the information gain for causal dataset acquisition

Merging datasets across institutions is a lengthy and costly procedure, especially when it involves private information. Data hosts may therefore want to prospectively gauge which datasets are most beneficial to merge with, without revealing sensitive information. For causal estimation this is particularly challenging as the value of a merge will depend not only on the reduction in epistemic uncertainty but also the improvement in overlap. To address this challenge, we introduce the first cryptographically secure information-theoretic approach for quantifying the value of a merge in the context of heterogeneous treatment effect estimation. We do this by evaluating the Expected Information Gain (EIG) and utilising multi-party computation to ensure it can be securely computed without revealing any raw data. As we demonstrate, this can be used with differential privacy (DP) to ensure privacy requirements whilst preserving more accurate computation than naive DP alone. To the best of our knowledge, this work presents the first privacy-preserving method for dataset acquisition tailored to causal estimation. We demonstrate the effectiveness and reliability of our method on a range of simulated and realistic benchmarks. The code is available anonymously.

翻译：跨机构合并数据集是一个耗时且成本高昂的过程，尤其在涉及私有信息时。因此，数据持有者可能希望在不泄露敏感信息的前提下，前瞻性地评估与哪些数据集合并最为有益。对于因果估计而言，这尤其具有挑战性，因为合并的价值不仅取决于认知不确定性的减少，还取决于重叠度的改善。为应对这一挑战，我们首次提出了一种密码学安全的信息论方法，用于在异质处理效应估计的背景下量化合并的价值。我们通过评估期望信息增益（EIG）并利用多方计算技术来实现这一目标，确保在无需透露任何原始数据的情况下安全地进行计算。如我们所展示的，该方法可与差分隐私（DP）结合使用，在满足隐私要求的同时，比单纯使用朴素DP保留更准确的计算结果。据我们所知，这项工作首次提出了专为因果估计量身定制的隐私保护数据集获取方法。我们在一系列模拟和现实基准测试中验证了该方法的有效性和可靠性。代码可匿名获取。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日