Privacy-preserving learning of gradient boosting decision trees (GBDT) has the potential for strong utility-privacy tradeoffs for tabular data, such as census data or medical meta data: classical GBDT learners can extract non-linear patterns from small sized datasets. The state-of-the-art notion for provable privacy-properties is differential privacy, which requires that the impact of single data points is limited and deniable. We introduce a novel differentially private GBDT learner and utilize four main techniques to improve the utility-privacy tradeoff. (1) We use an improved noise scaling approach with tighter accounting of privacy leakage of a decision tree leaf compared to prior work, resulting in noise that in expectation scales with $O(1/n)$, for $n$ data points. (2) We integrate individual R\'enyi filters to our method to learn from data points that have been underutilized during an iterative training process, which -- potentially of independent interest -- results in a natural yet effective insight to learning streams of non-i.i.d. data. (3) We incorporate the concept of random decision tree splits to concentrate privacy budget on learning leaves. (4) We deploy subsampling for privacy amplification. Our evaluation shows for the Abalone dataset ($<4k$ training data points) a $R^2$-score of $0.39$ for $\varepsilon=0.15$, which the closest prior work only achieved for $\varepsilon=10.0$. On the Adult dataset ($50k$ training data points) we achieve test error of $18.7\,\%$ for $\varepsilon=0.07$ which the closest prior work only achieved for $\varepsilon=1.0$. For the Abalone dataset for $\varepsilon=0.54$ we achieve $R^2$-score of $0.47$ which is very close to the $R^2$-score of $0.54$ for the nonprivate version of GBDT. For the Adult dataset for $\varepsilon=0.54$ we achieve test error $17.1\,\%$ which is very close to the test error $13.7\,\%$ of the nonprivate version of GBDT.
翻译:梯度提升决策树(GBDT)的隐私保护学习在处理表格数据(如普查数据或医学元数据)时,有望实现强效用-隐私权衡:经典GBDT学习器能够从小型数据集中提取非线性模式。可证明隐私性质的最先进概念是差分隐私,其要求单个数据点的影响有限且可否认。我们提出了一种新颖的差分隐私GBDT学习器,并利用四种主要技术来改善效用-隐私权衡。(1)与先前工作相比,我们采用改进的噪声缩放方法,对决策树叶节点的隐私泄露进行更严格的核算,使得对于$n$个数据点,噪声期望与$O(1/n)$成比例缩放。(2)我们将个体Rényi过滤器集成到方法中,以从迭代训练过程中未被充分利用的数据点中学习——这一可能具有独立价值的思路,为学习非独立同分布数据流提供了一种自然且有效的见解。(3)我们引入随机决策树分裂的概念,以将隐私预算集中用于学习叶节点。(4)我们采用子采样进行隐私放大。我们的评估显示,对于Abalone数据集(训练数据点少于4000个),当$\varepsilon=0.15$时$R^2$分数为0.39,而先前最接近的工作仅在$\varepsilon=10.0$时达到此值。在Adult数据集(训练数据点5万个)上,当$\varepsilon=0.07$时我们实现了18.7%的测试误差,而先前最接近的工作仅在$\varepsilon=1.0$时实现。对于Abalone数据集,当$\varepsilon=0.54$时我们获得0.47的$R^2$分数,非常接近非隐私版本GBDT的0.54。对于Adult数据集,当$\varepsilon=0.54$时我们实现17.1%的测试误差,非常接近非隐私版本GBDT的13.7%。