We introduce an optimization-based reconstruction attack capable of completely or near-completely reconstructing a dataset utilized for training a random forest. Notably, our approach relies solely on information readily available in commonly used libraries such as scikit-learn. To achieve this, we formulate the reconstruction problem as a combinatorial problem under a maximum likelihood objective. We demonstrate that this problem is NP-hard, though solvable at scale using constraint programming -- an approach rooted in constraint propagation and solution-domain reduction. Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. This holds true even with a small number of trees. Even with bootstrap aggregation, the majority of the data can also be reconstructed. These findings underscore a critical vulnerability inherent in widely adopted ensemble methods, warranting attention and mitigation. Although the potential for such reconstruction attacks has been discussed in privacy research, our study provides clear empirical evidence of their practicability.
翻译:我们提出一种基于优化的重构攻击方法,能够完全或近乎完全地重构用于训练随机森林的数据集。值得注意的是,我们的方法仅依赖于常用库(如scikit-learn)中可轻易获取的信息。为实现这一目标,我们将重构问题形式化为最大似然目标下的组合优化问题。我们证明该问题是NP难的,但可通过约束规划——一种基于约束传播与解空间缩减的方法——实现大规模求解。通过广泛的计算实验,我们证明未采用自助聚合但实施了特征随机化的随机森林容易遭受完全重构攻击,即使仅使用少量树时亦是如此。即使采用自助聚合,大部分数据同样可被重构。这些发现揭示了广泛采用的集成方法中固有的严重安全漏洞,亟需关注与缓解。尽管此类重构攻击的可能性在隐私研究中已有讨论,但本研究为其实际可行性提供了明确的实证依据。