In surveys, it is typically up to the individuals to decide if they want to participate or not, which leads to participation bias: the individuals willing to share their data might not be representative of the entire population. Similarly, there are cases where one does not have direct access to any data of the target population and has to resort to publicly available proxy data sampled from a different distribution. In this paper, we present Differentially Private Propensity Scores for Bias Correction (DiPPS), a method for approximating the true data distribution of interest in both of the above settings. We assume that the data analyst has access to a dataset $\tilde{D}$ that was sampled from the distribution of interest in a biased way. As individuals may be more willing to share their data when given a privacy guarantee, we further assume that the analyst is allowed locally differentially private access to a set of samples $D$ from the true, unbiased distribution. Each data point from the private, unbiased dataset $D$ is mapped to a probability distribution over clusters (learned from the biased dataset $\tilde{D}$), from which a single cluster is sampled via the exponential mechanism and shared with the data analyst. This way, the analyst gathers a distribution over clusters, which they use to compute propensity scores for the points in the biased $\tilde{D}$, which are in turn used to reweight the points in $\tilde{D}$ to approximate the true data distribution. It is now possible to compute any function on the resulting reweighted dataset without further access to the private $D$. In experiments on datasets from various domains, we show that DiPPS successfully brings the distribution of the available dataset closer to the distribution of interest in terms of Wasserstein distance. We further show that this results in improved estimates for different statistics.
翻译:在调查中,个体通常自行决定是否参与,这导致参与偏差:愿意共享数据的个体可能无法代表整个总体。同样,在某些情况下,人们无法直接获取目标总体的任何数据,只能求助于从不同分布中采样的公开代理数据。本文提出用于偏差校正的差分隐私倾向评分(DiPPS),该方法可在上述两种场景下近似目标真实数据分布。我们假设数据分析师能够访问一个以有偏方式从目标分布中采样的数据集$\tilde{D}$。由于隐私保障可能提高个体共享数据的意愿,我们还假设分析师能够通过本地差分隐私方式访问来自真实无偏分布的一组样本$D$。无偏私有数据集$D$中的每个数据点被映射到基于有偏数据集$\tilde{D}$学习的聚类概率分布,通过指数机制从中采样单个聚类并共享给数据分析师。通过这种方式,分析师收集到聚类上的分布,并据此计算有偏数据集$\tilde{D}$中各点的倾向评分,进而对$\tilde{D}$中的点进行重新加权以近似真实数据分布。此后,可在无需再次访问私有数据$D$的情况下,对重新加权后的数据集计算任意函数。在多个领域的数据集实验中,我们证明DiPPS能有效使可用数据集的分布更接近目标分布(以Wasserstein距离度量),且进一步表明这能改进不同统计量的估计。