Class imbalance significantly degrades classification performance, yet its effects are rarely analyzed from a unified theoretical perspective. We propose a principled framework based on three fundamental scales: the imbalance coefficient $η$, the sample--dimension ratio $κ$, and the intrinsic separability $Δ$. Starting from the Gaussian Bayes classifier, we derive closed-form Bayes errors and show how imbalance shifts the discriminant boundary, yielding a deterioration slope that predicts four regimes: Normal, Mild, Extreme, and Catastrophic. Using a balanced high-dimensional genomic dataset, we vary only $η$ while keeping $κ$ and $Δ$ fixed. Across parametric and non-parametric models, empirical degradation closely follows theoretical predictions: minority Recall collapses once $\log(η)$ exceeds $Δ\sqrtκ$, Precision increases asymmetrically, and F1-score and PR-AUC decline in line with the predicted regimes. These results show that the triplet $(η,κ,Δ)$ provides a model-agnostic, geometrically grounded explanation of imbalance-induced deterioration.
翻译:类别不平衡显著降低分类性能,但其影响很少从统一的理论视角进行分析。我们提出了一个基于三个基本尺度的原则性框架:不平衡系数 $η$、样本-维度比 $κ$ 和内在可分离性 $Δ$。从高斯贝叶斯分类器出发,我们推导出闭式贝叶斯误差,并展示了不平衡如何移动判别边界,从而产生一个预测四种机制的性能退化斜率:正常、轻度、极端和灾难性。使用一个平衡的高维基因组数据集,我们仅改变 $η$ 而保持 $κ$ 和 $Δ$ 固定。在参数化和非参数化模型中,经验退化紧密遵循理论预测:一旦 $\log(η)$ 超过 $Δ\sqrtκ$,少数类召回率就会崩溃;精确度呈现不对称增长;F1分数和PR-AUC的下降与预测的机制相符。这些结果表明,三元组 $(η,κ,Δ)$ 提供了一个与模型无关、基于几何原理的关于不平衡导致性能退化的解释。