Machine learning is becoming ubiquitous. From finance to medicine, machine learning models are boosting decision-making processes and even outperforming humans in some tasks. This huge progress in terms of prediction quality does not however find a counterpart in the security of such models and corresponding predictions, where perturbations of fractions of the training set (poisoning) can seriously undermine the model accuracy. Research on poisoning attacks and defenses received increasing attention in the last decade, leading to several promising solutions aiming to increase the robustness of machine learning. Among them, ensemble-based defenses, where different models are trained on portions of the training set and their predictions are then aggregated, provide strong theoretical guarantees at the price of a linear overhead. Surprisingly, ensemble-based defenses, which do not pose any restrictions on the base model, have not been applied to increase the robustness of random forest models. The work in this paper aims to fill in this gap by designing and implementing a novel hash-based ensemble approach that protects random forest against untargeted, random poisoning attacks. An extensive experimental evaluation measures the performance of our approach against a variety of attacks, as well as its sustainability in terms of resource consumption and performance, and compares it with a traditional monolithic model based on random forest. A final discussion presents our main findings and compares our approach with existing poisoning defenses targeting random forests.
翻译:机器学习正变得无处不在。从金融到医疗,机器学习模型正在优化决策过程,甚至在某些任务上超越人类。然而,预测质量方面的巨大进步并未在模型及其相应预测的安全性上得到体现——训练集部分的扰动(投毒)会严重损害模型精度。过去十年中,针对投毒攻击与防御的研究日益受到关注,催生了多种旨在提升机器学习鲁棒性的有前景方案。其中,基于集成的防御方法(即在不同训练子集上训练多个模型,然后聚合其预测结果)以线性开销为代价提供了强有力的理论保证。令人惊讶的是,这种对基模型无限制的集成防御方法尚未被用于增强随机森林模型的鲁棒性。本文旨在填补这一空白,设计并实现一种新型基于哈希的集成方法,以保护随机森林免受无针对性随机投毒攻击。通过大量实验评估,我们衡量了该方法在面对多种攻击时的性能表现及其在资源消耗和性能方面的可持续性,并将其与传统的基于随机森林的单一模型进行了对比。最终讨论总结了主要发现,并将我们的方法与现有针对随机森林的投毒防御方案进行比较。