Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.
翻译:随机森林因其在分类和回归任务中表现出的高效性而广受认可,在医疗诊断、金融和个性化推荐等多个领域表现突出。然而,这些领域本质上对隐私问题高度敏感,因为涉及个人和机密数据。随着被遗忘权需求的日益增长,特别是在GDPR和CCPA等法规的推动下,实现机器遗忘的能力对于随机森林变得至关重要。然而,这一主题尚未得到足够关注,现有方法在应用于实际场景时面临困难。为弥补这一空白,我们提出了DynFrs框架,旨在实现随机森林的高效机器遗忘,同时保持预测准确性。DynFrs利用子采样方法Occ(q)和惰性标记策略Lzy,并且仍可适配于任何随机森林变体。本质上,Occ(q)确保训练集中的每个样本仅出现在一部分树中,从而限制删除样本的影响;而Lzy则将树节点的重建延迟至必要时进行,从而避免对树结构进行不必要的修改。在实验中,将DynFrs应用于极端随机树取得了显著改进,相比现有的随机森林机器遗忘方法,实现了数量级更快的遗忘性能以及更好的预测准确性。