We introduce an optimization-based reconstruction attack capable of completely or near-completely reconstructing a dataset utilized for training a random forest. Notably, our approach relies solely on information readily available in commonly used libraries such as scikit-learn. To achieve this, we formulate the reconstruction problem as a combinatorial problem under a maximum likelihood objective. We demonstrate that this problem is NP-hard, though solvable at scale using constraint programming -- an approach rooted in constraint propagation and solution-domain reduction. Through an extensive computational investigation, we demonstrate that random forests trained without bootstrap aggregation but with feature randomization are susceptible to a complete reconstruction. This holds true even with a small number of trees. Even with bootstrap aggregation, the majority of the data can also be reconstructed. These findings underscore a critical vulnerability inherent in widely adopted ensemble methods, warranting attention and mitigation. Although the potential for such reconstruction attacks has been discussed in privacy research, our study provides clear empirical evidence of their practicability.
翻译:我们提出一种基于优化的重建攻击方法,能够完整或近乎完整地重建用于训练随机森林的数据集。值得注意的是,我们的方法仅依赖于常用库(如scikit-learn)中可获取的信息。为实现这一目标,我们将重建问题转化为最大似然目标下的组合优化问题。我们证明该问题属于NP难问题,但可通过约束规划——一种基于约束传播和解空间缩减的方法——在合理规模下求解。通过大量计算实验证明:未采用自助聚合但使用特征随机化的随机森林训练模型易被完全重建,即使树的数量较少时也是如此。即使采用自助聚合,大部分数据仍可被重建。这些发现揭示了广泛使用的集成方法中存在的关键漏洞,亟需关注与缓解。尽管隐私研究领域已讨论过此类重建攻击的可能性,但我们的研究为其实际可行性提供了明确的实证依据。