Machine learning is becoming ubiquitous. From finance to medicine, machine learning models are boosting decision-making processes and even outperforming humans in some tasks. This huge progress in terms of prediction quality does not however find a counterpart in the security of such models and corresponding predictions, where perturbations of fractions of the training set (poisoning) can seriously undermine the model accuracy. Research on poisoning attacks and defenses received increasing attention in the last decade, leading to several promising solutions aiming to increase the robustness of machine learning. Among them, ensemble-based defenses, where different models are trained on portions of the training set and their predictions are then aggregated, provide strong theoretical guarantees at the price of a linear overhead. Surprisingly, ensemble-based defenses, which do not pose any restrictions on the base model, have not been applied to increase the robustness of random forest models. The work in this paper aims to fill in this gap by designing and implementing a novel hash-based ensemble approach that protects random forest against untargeted, random poisoning attacks. An extensive experimental evaluation measures the performance of our approach against a variety of attacks, as well as its sustainability in terms of resource consumption and performance, and compares it with a traditional monolithic model based on random forest. A final discussion presents our main findings and compares our approach with existing poisoning defenses targeting random forests.
翻译:机器学习正在变得无处不在。从金融到医学,机器学习模型正在提升决策过程,甚至在某些任务上超越人类。然而,预测质量方面的巨大进步并未在模型安全性及相应预测方面得到体现——对训练集部分样本的扰动(投毒)可能严重损害模型准确性。过去十年间,针对投毒攻击与防御的研究日益受到关注,催生了多种旨在提升机器学习鲁棒性的解决方案。其中,集成防御方法(在不同训练子集上训练多个模型后聚合其预测结果)在仅增加线性开销的前提下提供了强理论保证。令人惊讶的是,这种不限制基模型的集成防御方法尚未被应用于提升随机森林模型的鲁棒性。本文旨在通过设计并实现一种新颖的基于哈希的集成方法来填补这一空白,该方法可保护随机森林免受无目标随机投毒攻击。通过大量实验评估,我们测量了该方法在多种攻击下的性能表现、资源消耗与运行效率的可持续性,并与基于随机森林的传统单体模型进行对比。最终讨论部分呈现了主要发现,并将该方法与现有针对随机森林的投毒防御方案进行了比较。