Random Forests are widely recognized for establishing efficacy in classification and regression tasks, standing out in various domains such as medical diagnosis, finance, and personalized recommendations. These domains, however, are inherently sensitive to privacy concerns, as personal and confidential data are involved. With increasing demand for the right to be forgotten, particularly under regulations such as GDPR and CCPA, the ability to perform machine unlearning has become crucial for Random Forests. However, insufficient attention was paid to this topic, and existing approaches face difficulties in being applied to real-world scenarios. Addressing this gap, we propose the DynFrs framework designed to enable efficient machine unlearning in Random Forests while preserving predictive accuracy. Dynfrs leverages subsampling method Occ(q) and a lazy tag strategy Lzy, and is still adaptable to any Random Forest variant. In essence, Occ(q) ensures that each sample in the training set occurs only in a proportion of trees so that the impact of deleting samples is limited, and Lzy delays the reconstruction of a tree node until necessary, thereby avoiding unnecessary modifications on tree structures. In experiments, applying Dynfrs on Extremely Randomized Trees yields substantial improvements, achieving orders of magnitude faster unlearning performance and better predictive accuracy than existing machine unlearning methods for Random Forests.
翻译:随机森林因其在分类和回归任务中建立的效能而广受认可,在医疗诊断、金融和个性化推荐等多个领域表现突出。然而,这些领域本质上对隐私问题敏感,涉及个人和机密数据。随着被遗忘权需求的日益增长,特别是在GDPR和CCPA等法规下,实现机器遗忘的能力对于随机森林变得至关重要。然而,该主题未得到充分关注,现有方法在应用于现实场景时面临困难。针对这一空白,我们提出了DynFrs框架,旨在实现随机森林中高效的机器遗忘,同时保持预测准确性。DynFrs利用子采样方法Occ(q)和惰性标记策略Lzy,并且仍可适应任何随机森林变体。本质上,Occ(q)确保训练集中的每个样本仅出现在一部分树中,从而限制删除样本的影响;Lzy则将树节点的重建延迟到必要时,从而避免对树结构进行不必要的修改。在实验中,将DynFrs应用于极端随机树取得了显著改进,与现有的随机森林机器遗忘方法相比,实现了数量级更快的遗忘性能和更好的预测准确性。