In classification problems, the datasets are usually imbalanced, noisy or complex. Most sampling algorithms only make some improvements to the linear sampling mechanism of the synthetic minority oversampling technique (SMOTE). Nevertheless, linear oversampling has several unavoidable drawbacks. Linear oversampling is susceptible to overfitting, and the synthetic samples lack diversity and rarely account for the original distribution characteristics. An informed nonlinear oversampling framework with the granular ball (INGB) as a new direction of oversampling is proposed in this paper. It uses granular balls to simulate the spatial distribution characteristics of datasets, and informed entropy is utilized to further optimize the granular-ball space. Then, nonlinear oversampling is performed by following high-dimensional sparsity and the isotropic Gaussian distribution. Furthermore, INGB has good compatibility. Not only can it be combined with most SMOTE-based sampling algorithms to improve their performance, but it can also be easily extended to noisy imbalanced multi-classification problems. The mathematical model and theoretical proof of INGB are given in this work. Extensive experiments demonstrate that INGB outperforms the traditional linear sampling frameworks and algorithms in oversampling on complex datasets.
翻译:在分类问题中,数据集通常存在不平衡、噪声或复杂性。现有采样算法大多仅对合成少数类过采样技术(SMOTE)的线性采样机制进行改进。然而,线性过采样存在若干不可避免的缺陷:易导致过拟合,生成样本缺乏多样性且难以反映原始数据分布特征。本文提出一种以粒度球(INGB)为新方向的知情非线性过采样框架。该框架利用粒度球模拟数据集的空间分布特性,并通过信息熵进一步优化粒度球空间;随后依据高维稀疏性与各向同性高斯分布执行非线性过采样。此外,INGB具有良好的兼容性——不仅可与多数基于SMOTE的采样算法结合以提升性能,还能轻松扩展至噪声不平衡多分类问题。本文给出了INGB的数学模型与理论证明,大量实验表明,在复杂数据集上INGB的过采样性能优于传统线性采样框架与算法。