A key challenge in many modern data analysis tasks is that user data are heterogeneous. Different users may possess vastly different numbers of data points. More importantly, it cannot be assumed that all users sample from the same underlying distribution. This is true, for example in language data, where different speech styles result in data heterogeneity. In this work we propose a simple model of heterogeneous user data that allows user data to differ in both distribution and quantity of data, and provide a method for estimating the population-level mean while preserving user-level differential privacy. We demonstrate asymptotic optimality of our estimator and also prove general lower bounds on the error achievable in the setting we introduce.
翻译:现代数据分析任务中的一个关键挑战在于用户数据的异质性。不同用户可能拥有数量差异巨大的数据点,更重要的是,无法假设所有用户都来自同一底层分布。例如在语言数据中,不同的说话风格会导致数据异质性。本文我们提出一个简单的异质性用户数据模型,允许不同用户的数据在分布和数量上存在差异,并提供一种在保护用户级差分隐私的前提下估计总体均值的方法。我们证明了该估计量的渐近最优性,并给出了所引入场景下可实现误差的通用下界。