Class imbalance remains a significant challenge in machine learning, particularly for tabular data classification tasks. While Gradient Boosting Decision Trees (GBDT) models have proven highly effective for such tasks, their performance can be compromised when dealing with imbalanced datasets. This paper presents the first comprehensive study on adapting class-balanced loss functions to three GBDT algorithms across various tabular classification tasks, including binary, multi-class, and multi-label classification. We conduct extensive experiments on multiple datasets to evaluate the impact of class-balanced losses on different GBDT models, establishing a valuable benchmark. Our results demonstrate the potential of class-balanced loss functions to enhance GBDT performance on imbalanced datasets, offering a robust approach for practitioners facing class imbalance challenges in real-world applications. Additionally, we introduce a Python package that facilitates the integration of class-balanced loss functions into GBDT workflows, making these advanced techniques accessible to a wider audience.
翻译:类别不平衡仍然是机器学习中的一个重要挑战,尤其是在表格数据分类任务中。虽然梯度提升决策树(GBDT)模型已被证明对此类任务非常有效,但在处理不平衡数据集时,其性能可能会受到影响。本文首次全面研究了将类别平衡损失函数适配到三种GBDT算法中,涵盖多种表格分类任务,包括二分类、多分类和多标签分类。我们在多个数据集上进行了广泛的实验,以评估类别平衡损失对不同GBDT模型的影响,建立了一个有价值的基准。我们的结果表明,类别平衡损失函数有潜力提升GBDT在不平衡数据集上的性能,为实际应用中面临类别不平衡挑战的从业者提供了一种稳健的方法。此外,我们引入了一个Python包,便于将类别平衡损失函数集成到GBDT工作流程中,使这些先进技术能够为更广泛的受众所用。