Benchmark of Data Preprocessing Methods for Imbalanced Classification

Severe class imbalance is one of the main conditions that make machine learning in cybersecurity difficult. A variety of dataset preprocessing methods have been introduced over the years. These methods modify the training dataset by oversampling, undersampling or a combination of both to improve the predictive performance of classifiers trained on this dataset. Although these methods are used in cybersecurity occasionally, a comprehensive, unbiased benchmark comparing their performance over a variety of cybersecurity problems is missing. This paper presents a benchmark of 16 preprocessing methods on six cybersecurity datasets together with 17 public imbalanced datasets from other domains. We test the methods under multiple hyperparameter configurations and use an AutoML system to train classifiers on the preprocessed datasets, which reduces potential bias from specific hyperparameter or classifier choices. Special consideration is also given to evaluating the methods using appropriate performance measures that are good proxies for practical performance in real-world cybersecurity systems. The main findings of our study are: 1) Most of the time, a data preprocessing method that improves classification performance exists. 2) Baseline approach of doing nothing outperformed a large portion of methods in the benchmark. 3) Oversampling methods generally outperform undersampling methods. 4) The most significant performance gains are brought by the standard SMOTE algorithm and more complicated methods provide mainly incremental improvements at the cost of often worse computational performance.

翻译：严重类别不平衡是导致机器学习在网络安全领域应用困难的主要条件之一。多年来，研究者提出了多种数据集预处理方法，这些方法通过过采样、欠采样或两者结合的方式修改训练数据集，以提升在此数据集上训练的分类器的预测性能。尽管这些方法在网络安全中偶有应用，但尚缺乏全面、无偏的基准测试来比较其在各类网络安全问题上的表现。本文对16种预处理方法在六个网络安全数据集以及来自其他领域的17个公开不平衡数据集上进行了基准测试。我们在多种超参数配置下测试这些方法，并使用AutoML系统在预处理后的数据集上训练分类器，以减少特定超参数或分类器选择带来的潜在偏差。此外，特别关注了使用合适的性能度量来评估方法，这些度量能有效反映实际网络安全系统中的实用性能。本研究的主要发现包括：1）在大多数情况下，存在能提升分类性能的数据预处理方法；2）基线方法（不进行任何预处理）在基准测试中优于大部分方法；3）过采样方法整体上优于欠采样方法；4）标准SMOTE算法带来的性能提升最为显著，而更复杂的方法主要提供增量改进，但往往以计算性能下降为代价。