Noise is a fundamental problem in learning theory with huge effects in the application of Machine Learning (ML) methods, due to real world data tendency to be noisy. Additionally, introduction of malicious noise can make ML methods fail critically, as is the case with adversarial attacks. Thus, finding and developing alternatives to improve robustness to noise is a fundamental problem in ML. In this paper, we propose a method to deal with noise: mitigating its effect through the use of data abstractions. The goal is to reduce the effect of noise over the model's performance through the loss of information produced by the abstraction. However, this information loss comes with a cost: it can result in an accuracy reduction due to the missing information. First, we explored multiple methodologies to create abstractions, using the training dataset, for the specific case of numerical data and binary classification tasks. We also tested how these abstractions can affect robustness to noise with several experiments that explore the robustness of an Artificial Neural Network to noise when trained using raw data \emph{vs} when trained using abstracted data. The results clearly show that using abstractions is a viable approach for developing noise robust ML methods.
翻译:噪声是学习理论中的一个基本问题,由于现实世界数据往往具有噪声特性,其对机器学习方法的应用产生巨大影响。此外,恶意噪声的引入可能导致机器学习方法严重失效,对抗攻击便是典型案例。因此,寻找并开发提升噪声鲁棒性的替代方案成为机器学习的核心课题。本文提出一种处理噪声的方法:通过数据抽象化来降低噪声影响。其目标是通过抽象化造成的信息损失来减少噪声对模型性能的影响。然而,这种信息损失需要代价:可能因信息缺失导致准确率下降。首先,我们针对数值数据和二分类任务的具体场景,探索了基于训练数据集构建抽象化的多种方法。通过系列实验,我们测试了这些抽象化如何影响噪声鲁棒性,比较了人工神经网络使用原始数据训练与使用抽象化数据训练时对噪声的鲁棒性。实验结果明确表明,采用抽象化是开发噪声鲁棒机器学习方法的可行途径。