Imbalanced Classification under Capacity Constraints

Detecting observations from a minority class under severe class imbalance is a central challenge in applications such as fraud detection, medical screening, and industrial quality control. In these settings, each positive prediction triggers a costly follow-up action, an MRI scan, a transaction audit, whose execution is subject to real operational constraints. This paper proposes a formal classification framework under capacity constraints: given a user-defined bound limit $b$ on the proportion of observations that can be labeled as belonging to the minority class, the goal is to find the classifier that maximizes sensitivity on that class. We characterize the optimal classifier under this constraint and establish its equivalence with the classical Bayes classifier under a reweighting of the prior probabilities. We also introduce a capacity-adjusted performance metric $M$ that accounts for the effective detection rate when the capacity constraint is binding. The framework is implemented on top of standard learning methods, k-NN, SVM, random forests, and neural networks, and statistical consistency is established for each. We further show that these methods reduce to post-hoc thresholding when no hyperparameters are oriented toward the capacity-constrained objective, and introduce a capacity-aware support vector machine that exploits the constraint during training and achieves the strongest empirical performance. Experiments on the Taiwanese credit card default dataset confirm that capacity-constrained classifiers substantially outperform both classical approaches and SMOTE under high imbalance regimes. The framework extends naturally to multiclass settings and online environments.

翻译：在欺诈检测、医学筛查及工业质量控制等应用中，从严重类别非平衡的少数类群体中识别观测样本是核心挑战。这些场景中，每个正类预测都会引发昂贵的后续操作——如核磁共振扫描或交易审计，而这类操作的执行受到真实环境中的运营约束。本文提出一种容量约束下的形式化分类框架：给定用户自定义的边界限制参数$b$（即被标记为少数类别的观测样本占比上限），目标是最大化该类别上的灵敏度。我们刻画了该约束下的最优分类器，并证明其与重加权先验概率后的经典贝叶斯分类器等价。同时引入容量调整性能指标$M$，用于度量容量约束生效时的有效检测率。该框架可基于k-NN、SVM、随机森林及神经网络等标准学习方法实现，并建立了各自的统计一致性。进一步研究表明，当超参数不面向容量约束目标优化时，这些方法会退化为事后阈值调整策略；我们设计的容量感知支持向量机（capacity-aware SVM）可在训练阶段利用该约束，实现最强的实证性能。在台湾信用卡违约数据集上的实验表明，高非平衡场景下，容量约束分类器显著优于经典方法及SMOTE算法。该框架可自然扩展至多类别设定与在线学习环境。