S-GBDT: Frugal Differentially Private Gradient Boosting Decision Trees

Privacy-preserving learning of gradient boosting decision trees (GBDT) has the potential for strong utility-privacy tradeoffs for tabular data, such as census data or medical meta data: classical GBDT learners can extract non-linear patterns from small sized datasets. The state-of-the-art notion for provable privacy-properties is differential privacy, which requires that the impact of single data points is limited and deniable. We introduce a novel differentially private GBDT learner and utilize four main techniques to improve the utility-privacy tradeoff. (1) We use an improved noise scaling approach with tighter accounting of privacy leakage of a decision tree leaf compared to prior work, resulting in noise that in expectation scales with $O(1/n)$, for $n$ data points. (2) We integrate individual R\'enyi filters to our method to learn from data points that have been underutilized during an iterative training process, which -- potentially of independent interest -- results in a natural yet effective insight to learning streams of non-i.i.d. data. (3) We incorporate the concept of random decision tree splits to concentrate privacy budget on learning leaves. (4) We deploy subsampling for privacy amplification. Our evaluation shows for the Abalone dataset ($<4k$ training data points) a $R^2$-score of $0.39$ for $\varepsilon=0.15$, which the closest prior work only achieved for $\varepsilon=10.0$. On the Adult dataset ($50k$ training data points) we achieve test error of $18.7\,\%$ for $\varepsilon=0.07$ which the closest prior work only achieved for $\varepsilon=1.0$. For the Abalone dataset for $\varepsilon=0.54$ we achieve $R^2$-score of $0.47$ which is very close to the $R^2$-score of $0.54$ for the nonprivate version of GBDT. For the Adult dataset for $\varepsilon=0.54$ we achieve test error $17.1\,\%$ which is very close to the test error $13.7\,\%$ of the nonprivate version of GBDT.

翻译：梯度提升决策树（GBDT）的隐私保护学习在处理表格数据（如人口普查数据或医疗元数据）时具有实现强效用-隐私权衡的潜力：经典GBDT学习器能从小型数据集中提取非线性模式。可证明隐私属性的最先进概念是差分隐私，它要求单个数据点的影响有限且可否认。我们提出一种新型差分隐私GBDT学习器，并利用四种主要技术来改进效用-隐私权衡。（1）我们采用改进的噪声缩放方法，相比先前工作更严格地核算决策树叶子的隐私泄露，使得预期噪声随$n$个数据点按$O(1/n)$规模缩放。（2）我们将个体Rényi过滤器集成到方法中，以从迭代训练过程中未被充分利用的数据点中学习——这（可能具有独立意义）产生了一种自然且有效的洞察，用于学习非独立同分布数据流。（3）我们引入随机决策树分裂的概念，将隐私预算集中于学习叶子节点。（4）我们采用子采样进行隐私放大。我们的评估显示，对于Abalone数据集（<4k个训练数据点），当$\varepsilon=0.15$时$R^2$得分为$0.39$，而最接近的先前工作仅在$\varepsilon=10.0$时达到该值。对于Adult数据集（5万个训练数据点），当$\varepsilon=0.07$时我们实现$18.7\%$的测试误差，而最接近的先前工作仅在$\varepsilon=1.0$时达到。对于Abalone数据集，当$\varepsilon=0.54$时我们实现$R^2$得分$0.47$，非常接近非隐私版GBDT的$R^2$得分$0.54$。对于Adult数据集，当$\varepsilon=0.54$时我们实现测试误差$17.1\%$，非常接近非隐私版GBDT的测试误差$13.7\%$。