In this paper, we study how class imbalance, typical of low-default credit portfolios, affects the performance of logistic regression models. Using a simulation study with controlled data-generating mechanisms, we vary (i) the level of class imbalance and (ii) the strength of association between the predictors and the response. The results show that, for a given strength of association, achievable classification accuracy deteriorates markedly as the event rate decreases, and the optimal classification cut-off shifts with the level of imbalance. In contrast, the Gini coefficient is comparatively stable with respect to class imbalance once sample sizes are sufficiently large, even when classification accuracy is strongly affected. As a practical guideline, we summarise attainable classification performance as a function of the event rate and strength of association between the predictors and the response.
翻译:本文研究了低违约信用组合中典型的类别不平衡如何影响逻辑回归模型的性能。通过采用具有可控数据生成机制的模拟研究,我们变化了(i)类别不平衡的程度和(ii)预测变量与响应变量之间关联的强度。结果表明,对于给定的关联强度,随着事件率的降低,可实现的分类准确率显著下降,且最优分类阈值随不平衡程度而变化。相比之下,一旦样本量足够大,即使分类准确率受到强烈影响,基尼系数相对于类别不平衡也相对稳定。作为实用指南,我们总结了可实现的分类性能与事件率以及预测变量和响应变量之间关联强度的函数关系。