We study the behavior of linear discriminant functions for binary classification in the infinite-imbalance limit, where the sample size of one class grows without bound while the sample size of the other remains fixed. The coefficients of the classifier minimize an empirical loss specified through a weight function. We show that for a broad class of weight functions, the intercept diverges but the rest of the coefficient vector has a finite almost sure limit under infinite imbalance, extending prior work on logistic regression. The limit depends on the left-tail growth rate of the weight function, for which we distinguish two cases: subexponential and exponential. The limiting coefficient vectors reflect robustness or conservatism properties in the sense that they optimize against certain worst-case alternatives. In the subexponential case, the limit is equivalent to an implicit choice of upsampling distribution for the minority class. We apply these ideas in a credit risk setting, with particular emphasis on performance in the high-sensitivity and high-specificity regions.
翻译:本文研究二分类问题中线性判别函数在无限不平衡极限下的行为特征:当一类样本数量趋于无穷而另一类样本数量固定不变时,分类器系数通过权重函数最小化经验损失。我们证明对于一类广泛的权重函数,在无限不平衡条件下,截距项发散但其余系数向量存在有限几乎必然极限,该结论扩展了先前关于逻辑回归的研究工作。该极限取决于权重函数的左尾增长率,据此我们区分次指数与指数两种情形。极限系数向量反映了鲁棒性或保守性特征,其本质是针对某些最坏情况替代方案进行优化。在次指数情形下,该极限等价于对少数类进行隐式上采样分布选择。我们将这些思想应用于信用风险场景,重点关注高灵敏度与高特异度区域的性能表现。