This paper evaluates XGboost's performance given different dataset sizes and class distributions, from perfectly balanced to highly imbalanced. XGBoost has been selected for evaluation, as it stands out in several benchmarks due to its detection performance and speed. After introducing the problem of fraud detection, the paper reviews evaluation metrics for detection systems or binary classifiers, and illustrates with examples how different metrics work for balanced and imbalanced datasets. Then, it examines the principles of XGBoost. It proposes a pipeline for data preparation and compares a Vanilla XGBoost against a random search-tuned XGBoost. Random search fine-tuning provides consistent improvement for large datasets of 100 thousand samples, not so for medium and small datasets of 10 and 1 thousand samples, respectively. Besides, as expected, XGBoost recognition performance improves as more data is available, and deteriorates detection performance as the datasets become more imbalanced. Tests on distributions with 50, 45, 25, and 5 percent positive samples show that the largest drop in detection performance occurs for the distribution with only 5 percent positive samples. Sampling to balance the training set does not provide consistent improvement. Therefore, future work will include a systematic study of different techniques to deal with data imbalance and evaluating other approaches, including graphs, autoencoders, and generative adversarial methods, to deal with the lack of labels.
翻译:本文评估了XGBoost在不同数据集规模与类别分布(从完全平衡到高度不平衡)下的性能表现。由于XGBoost在多项基准测试中因其检测性能与速度而脱颖而出,故被选为评估对象。在介绍欺诈检测问题后,本文回顾了检测系统或二元分类器的评估指标,并通过示例说明不同指标在平衡与不平衡数据集中的运作方式。随后,本文剖析了XGBoost的原理,提出了一套数据预处理流程,并将原始XGBoost与经过随机搜索调优的XGBoost进行了对比。随机搜索调优在包含10万样本的大型数据集上实现了持续改进,但在包含1万和1千样本的中型及小型数据集上效果不明显。此外,正如预期,随着数据量增加,XGBoost的识别性能提升,而随着数据集不平衡程度加剧,其检测性能下降。在正样本占比分别为50%、45%、25%和5%的分布测试中,检测性能的最大降幅出现在正样本仅占5%的分布下。通过采样平衡训练集并未带来持续改进。因此,未来工作将系统研究处理数据不平衡的不同技术,并评估包括图方法、自编码器及生成对抗方法在内的其他途径,以应对标签缺失问题。