Imbalanced classification presents a formidable challenge in machine learning, particularly when tabular datasets are plagued by noise and overlapping class boundaries. From a geometric perspective, the core difficulty lies in the topological intrusion of the majority class into the minority manifold, which obscures the true decision boundary. Traditional undersampling techniques, such as Edited Nearest Neighbours (ENN), typically employ symmetric cleaning rules and uniform voting, failing to capture the local manifold structure and often inadvertently removing informative minority samples. In this paper, we propose GMR (Geometric Manifold Rectification), a novel framework designed to robustly handle imbalanced structured data by exploiting local geometric priors. GMR makes two contributions: (1) Geometric confidence estimation that uses inverse-distance weighted kNN voting with an adaptive distance metric to capture local reliability; and (2) asymmetric cleaning that is strict on majority samples while conservatively protecting minority samples via a safe-guarding cap on minority removal. Extensive experiments on multiple benchmark datasets show that GMR is competitive with strong sampling baselines.
翻译:不平衡分类在机器学习中提出了一个严峻的挑战,尤其是在表格数据集受到噪声和重叠类边界困扰时。从几何角度来看,核心困难在于多数类在拓扑结构上侵入了少数类流形,从而模糊了真实的决策边界。传统的欠采样技术,如编辑最近邻(ENN),通常采用对称的清理规则和均匀投票,无法捕捉局部流形结构,并且常常无意中移除了信息丰富的少数类样本。在本文中,我们提出了GMR(几何流形校正),这是一个新颖的框架,旨在通过利用局部几何先验,鲁棒地处理不平衡的结构化数据。GMR做出了两项贡献:(1)几何置信度估计,它使用具有自适应距离度量的反距离加权kNN投票来捕捉局部可靠性;(2)非对称清理,它对多数类样本严格,同时通过设定少数类样本移除的安全上限来保守地保护少数类样本。在多个基准数据集上的大量实验表明,GMR与强大的采样基线方法相比具有竞争力。