In many real-world applications, due to recent developments in the privacy landscape, training data may be aggregated to preserve the privacy of sensitive training labels. In the learning from label proportions (LLP) framework, the dataset is partitioned into bags of feature-vectors which are available only with the sum of the labels per bag. A further restriction, which we call learning from bag aggregates (LBA) is where instead of individual feature-vectors, only the (possibly weighted) sum of the feature-vectors per bag is available. We study whether such aggregation techniques can provide privacy guarantees under the notion of label differential privacy (label-DP) previously studied in for e.g. [Chaudhuri-Hsu'11, Ghazi et al.'21, Esfandiari et al.'22]. It is easily seen that naive LBA and LLP do not provide label-DP. Our main result however, shows that weighted LBA using iid Gaussian weights with $m$ randomly sampled disjoint $k$-sized bags is in fact $(\varepsilon, \delta)$-label-DP for any $\varepsilon > 0$ with $\delta \approx \exp(-\Omega(\sqrt{k}))$ assuming a lower bound on the linear-mse regression loss. Further, the $\ell_2^2$-regressor which minimizes the loss on the aggregated dataset has a loss within $\left(1 + o(1)\right)$-factor of the optimum on the original dataset w.p. $\approx 1 - exp(-\Omega(m))$. We emphasize that no additive label noise is required. The analogous weighted-LLP does not however admit label-DP. Nevertheless, we show that if additive $N(0, 1)$ noise can be added to any constant fraction of the instance labels, then the noisy weighted-LLP admits similar label-DP guarantees without assumptions on the dataset, while preserving the utility of Lipschitz-bounded neural mse-regression tasks. Our work is the first to demonstrate that label-DP can be achieved by randomly weighted aggregation for regression tasks, using no or little additive noise.
翻译:在许多实际应用中,由于隐私领域的最新发展,训练数据可能通过聚合来保护敏感训练标签的隐私。在标签比例学习(LLP)框架中,数据集被划分为若干特征向量包,每个包仅提供标签总和信息。进一步限制条件下,我们提出包聚合学习(LBA)问题,其中每个包仅提供(可能经过加权的)特征向量之和,而非单个特征向量。我们研究此类聚合技术能否在标签差分隐私(label-DP)概念下提供隐私保证——该概念此前已在[Chaudhuri-Hsu'11, Ghazi等'21, Esfandiari等'22]等文献中研究。显然,朴素LBA和LLP无法实现label-DP。然而我们的主要结果表明:使用独立同分布高斯权重、从$m$个随机采样的不相交$k$大小包进行加权LBA,在线性均方误差回归损失存在下界假设下,实际上针对任意$\varepsilon > 0$可实现$(\varepsilon, \delta)$-label-DP,其中$\delta \approx \exp(-\Omega(\sqrt{k}))$。进一步,在聚合数据集上最小化损失的$\ell_2^2$回归器,其损失值以约$1 - exp(-\Omega(m))$概率在原始数据集最优解的$\left(1 + o(1)\right)$因子范围内。需要强调的是,此过程无需添加标签噪声。但相应的加权LLP无法实现label-DP。尽管如此,我们证明若允许在任意常数比例实例标签上添加$N(0, 1)$噪声,则有噪声的加权LLP可在无数据集假设条件下实现类似的label-DP保证,同时保持Lipschitz有界神经网络均方误差回归任务的效用。本研究首次证明:针对回归任务,通过随机加权聚合可利用很少甚至无需额外噪声实现标签差分隐私。