Learning from Aggregated Data: Curated Bags versus Random Bags

Protecting user privacy is a major concern for many machine learning systems that are deployed at scale and collect from a diverse set of population. One way to address this concern is by collecting and releasing data labels in an aggregated manner so that the information about a single user is potentially combined with others. In this paper, we explore the possibility of training machine learning models with aggregated data labels, rather than individual labels. Specifically, we consider two natural aggregation procedures suggested by practitioners: curated bags where the data points are grouped based on common features and random bags where the data points are grouped randomly in bag of similar sizes. For the curated bag setting and for a broad range of loss functions, we show that we can perform gradient-based learning without any degradation in performance that may result from aggregating data. Our method is based on the observation that the sum of the gradients of the loss function on individual data examples in a curated bag can be computed from the aggregate label without the need for individual labels. For the random bag setting, we provide a generalization risk bound based on the Rademacher complexity of the hypothesis class and show how empirical risk minimization can be regularized to achieve the smallest risk bound. In fact, in the random bag setting, there is a trade-off between size of the bag and the achievable error rate as our bound indicates. Finally, we conduct a careful empirical study to confirm our theoretical findings. In particular, our results suggest that aggregate learning can be an effective method for preserving user privacy while maintaining model accuracy.

翻译：保护用户隐私是许多大规模部署且从多样化人群中收集数据的机器学习系统关注的主要问题。解决这一问题的途径之一是采用聚合方式收集和发布数据标签，从而将单个用户的信息可能与其他用户的信息相结合。本文探讨了使用聚合数据标签（而非个体标签）训练机器学习模型的可能性。具体而言，我们考虑了实践者提出的两种自然聚合过程：基于共同特征分组的精选袋，以及随机分组且袋大小相似的随机袋。对于精选袋设置，我们证明在广泛的损失函数范围内，可执行基于梯度的学习且不会因数据聚合而导致性能下降。该方法基于以下观察：精选袋中个体数据示例上损失函数的梯度之和可从聚合标签中计算得到，无需个体标签。对于随机袋设置，我们基于假设类的Rademacher复杂度给出了泛化风险界，并展示了如何通过正则化经验风险最小化来实现最小风险界。事实上，在随机袋设置中，正如我们的风险界所表明，袋大小与可达到的误差率之间存在权衡。最后，我们通过细致的实证研究验证了理论发现。特别地，我们的结果表明聚合学习可以在保持模型准确性的同时，成为保护用户隐私的有效方法。