In many real-world applications, in particular due to recent developments in the privacy landscape, training data may be aggregated to preserve the privacy of sensitive training labels. In the learning from label proportions (LLP) framework, the dataset is partitioned into bags of feature-vectors which are available only with the sum of the labels per bag. A further restriction, which we call learning from bag aggregates (LBA) is where instead of individual feature-vectors, only the (possibly weighted) sum of the feature-vectors per bag is available. We study whether such aggregation techniques can provide privacy guarantees under the notion of label differential privacy (label-DP) previously studied in for e.g. [Chaudhuri-Hsu'11, Ghazi et al.'21, Esfandiari et al.'22]. It is easily seen that naive LBA and LLP do not provide label-DP. Our main result however, shows that weighted LBA using iid Gaussian weights with $m$ randomly sampled disjoint $k$-sized bags is in fact $(\varepsilon, \delta)$-label-DP for any $\varepsilon > 0$ with $\delta \approx \exp(-\Omega(\sqrt{k}))$ assuming a lower bound on the linear-mse regression loss. Further, this preserves the optimum over linear mse-regressors of bounded norm to within $(1 \pm o(1))$-factor w.p. $\approx 1 - \exp(-\Omega(m))$. We emphasize that no additive label noise is required. The analogous weighted-LLP does not however admit label-DP. Nevertheless, we show that if additive $N(0, 1)$ noise can be added to any constant fraction of the instance labels, then the noisy weighted-LLP admits similar label-DP guarantees without assumptions on the dataset, while preserving the utility of Lipschitz-bounded neural mse-regression tasks. Our work is the first to demonstrate that label-DP can be achieved by randomly weighted aggregation for regression tasks, using no or little additive noise.
翻译:在许多实际应用中,特别是由于隐私领域的最新发展,训练数据可能被聚合以保护敏感训练标签的隐私。在基于标签比例学习(LLP)框架中,数据集被划分为多个特征向量包,每个包仅提供标签之和。进一步的限制称为基于包聚合学习(LBA),即不提供单个特征向量,仅提供每个包的(可能加权的)特征向量之和。我们研究这类聚合技术是否能在标签差分隐私(label-DP)概念下提供隐私保证,该概念此前在[Chaudhuri-Hsu'11, Ghazi et al.'21, Esfandiari et al.'22]等文献中有所研究。容易看出,朴素的LBA和LLP无法提供标签-DP。然而,我们的主要结果表明,在假设线性均方误差回归损失存在下界的前提下,使用独立同分布高斯权重、包含$m$个随机采样不重叠$k$大小包的加权LBA,实际上可实现$(\varepsilon, \delta)$-标签-DP,其中任意$\varepsilon > 0$且$\delta \approx \exp(-\Omega(\sqrt{k}))$。此外,该方法能保持有界范数线性均方误差回归器的最优值,其因子为$(1 \pm o(1))$,概率约为$1 - \exp(-\Omega(m))$。我们强调无需添加标签噪声。然而,类似的加权LLP并不支持标签-DP。尽管如此,我们证明,如果能够向任意恒定比例的实例标签添加$N(0, 1)$噪声,则带噪加权LLP可在无需数据集假设的前提下提供类似的标签-DP保证,同时保持Lipschitz有界神经网络均方误差回归任务的实用性。我们的工作是首个证明通过随机加权聚合可在回归任务中实现标签-DP,且几乎不需要或完全不需要添加噪声的研究。