We introduce S-BDT: a novel $(\varepsilon,\delta)$-differentially private distributed gradient boosted decision tree (GBDT) learner that improves the protection of single training data points (privacy) while achieving meaningful learning goals, such as accuracy or regression error (utility). S-BDT uses less noise by relying on non-spherical multivariate Gaussian noise, for which we show tight subsampling bounds for privacy amplification and incorporate that into a R\'enyi filter for individual privacy accounting. We experimentally reach the same utility while saving $50\%$ in terms of epsilon for $\varepsilon \le 0.5$ on the Abalone regression dataset (dataset size $\approx 4K$), saving $30\%$ in terms of epsilon for $\varepsilon \le 0.08$ for the Adult classification dataset (dataset size $\approx 50K$), and saving $30\%$ in terms of epsilon for $\varepsilon\leq0.03$ for the Spambase classification dataset (dataset size $\approx 5K$). Moreover, we show that for situations where a GBDT is learning a stream of data that originates from different subpopulations (non-IID), S-BDT improves the saving of epsilon even further.
翻译:我们提出S-BDT:一种新颖的$(\varepsilon,\delta)$-差分隐私分布式梯度提升决策树(GBDT)学习器,在实现有意义的学习目标(如准确率或回归误差)的同时,增强了对单个训练数据点(隐私性)的保护。S-BDT通过采用非球面多元高斯噪声来减少噪声添加量,我们为此推导了隐私放大的严格子采样边界,并将其整合到用于个体隐私核算的R\'enyi滤波器中。在Abalone回归数据集(数据集规模约4K)上,当$\varepsilon \le 0.5$时,我们在保持相同效用的同时实现了$\varepsilon$消耗降低50%;在Adult分类数据集(数据集规模约50K)上,当$\varepsilon \le 0.08$时实现$\varepsilon$消耗降低30%;在Spambase分类数据集(数据集规模约5K)上,当$\varepsilon\leq0.03$时实现$\varepsilon$消耗降低30%。此外,我们证明当GBDT学习来自不同子总体(非独立同分布)的数据流时,S-BDT能进一步显著提升$\varepsilon$的节省效果。