Imbalanced data poses a significant challenge in classification as model performance is affected by insufficient learning from minority classes. Balancing methods are often used to address this problem. However, such techniques can lead to problems such as overfitting or loss of information. This study addresses a more challenging aspect of balancing methods - their impact on model behavior. To capture these changes, Explainable Artificial Intelligence tools are used to compare models trained on datasets before and after balancing. In addition to the variable importance method, this study uses the partial dependence profile and accumulated local effects techniques. Real and simulated datasets are tested, and an open-source Python package edgaro is developed to facilitate this analysis. The results obtained show significant changes in model behavior due to balancing methods, which can lead to biased models toward a balanced distribution. These findings confirm that balancing analysis should go beyond model performance comparisons to achieve higher reliability of machine learning models. Therefore, we propose a new method performance gain plot for informed data balancing strategy to make an optimal selection of balancing method by analyzing the measure of change in model behavior versus performance gain.
翻译:不平衡数据对分类任务构成重大挑战,因为模型性能会因对少数类学习不足而受影响。平衡方法常用于解决此问题,但此类技术可能导致过拟合或信息丢失等问题。本研究聚焦平衡方法更具挑战性的层面——其对模型行为的影响。为捕捉这些变化,我们利用可解释人工智能工具比较在平衡前后数据集上训练的模型。除变量重要性方法外,本研究还采用偏依赖轮廓和累积局部效应技术。实验测试了真实数据集与模拟数据集,并开发了开源Python软件包edgaro以辅助分析。结果表明,平衡方法会显著改变模型行为,可能导致模型偏向平衡分布。这些发现证实:平衡分析应超越模型性能比较,以实现机器学习模型更高的可靠性。因此,我们提出一种新的性能增益图方法,通过分析模型行为变化程度与性能增益的权衡,为知情数据平衡策略提供最优平衡方法选择依据。