Imbalanced data are frequently encountered in real-world classification tasks. Previous works on imbalanced learning mostly focused on learning with a minority class of few samples. However, the notion of imbalance also applies to cases where the minority class contains abundant samples, which is usually the case for industrial applications like fraud detection in the area of financial risk management. In this paper, we take a population-level approach to imbalanced learning by proposing a new formulation called \emph{ultra-imbalanced classification} (UIC). Under UIC, loss functions behave differently even if infinite amount of training samples are available. To understand the intrinsic difficulty of UIC problems, we borrow ideas from information theory and establish a framework to compare different loss functions through the lens of statistical information. A novel learning objective termed Tunable Boosting Loss is developed which is provably resistant against data imbalance under UIC, as well as being empirically efficient verified by extensive experimental studies on both public and industrial datasets.
翻译:在实际分类任务中,不平衡数据是常见现象。先前关于不平衡学习的研究大多集中于少数类样本稀缺的情况。然而,不平衡的概念同样适用于少数类包含大量样本的场景,这在金融风险管理领域的欺诈检测等工业应用中尤为典型。本文从总体层面研究不平衡学习问题,提出了一种称为"超不平衡分类"的新理论框架。在UIC框架下,即使存在无限训练样本,损失函数仍会呈现不同的行为特性。为探究UIC问题的本质困难,我们借鉴信息论思想,建立了通过统计信息视角比较不同损失函数的分析框架。在此基础上,我们提出了一种名为"可调增强损失"的新型学习目标,该目标在理论上被证明能够有效抵抗UIC下的数据不平衡问题,同时通过在公开数据集和工业数据集上的大量实验研究,验证了其在实际应用中的高效性。