In supervised learning, it is quite frequent to be confronted with real imbalanced datasets. This situation leads to a learning difficulty for standard algorithms. Research and solutions in imbalanced learning have mainly focused on classification tasks. Despite its importance, very few solutions exist for imbalanced regression. In this paper, we propose a data augmentation procedure, the GOLIATH algorithm, based on kernel density estimates which can be used in classification and regression. This general approach encompasses two large families of synthetic oversampling: those based on perturbations, such as Gaussian Noise, and those based on interpolations, such as SMOTE. It also provides an explicit form of these machine learning algorithms and an expression of their conditional densities, in particular for SMOTE. New synthetic data generators are deduced. We apply GOLIATH in imbalanced regression combining such generator procedures with a wild-bootstrap resampling technique for the target values. We evaluate the performance of the GOLIATH algorithm in imbalanced regression situations. We empirically evaluate and compare our approach and demonstrate significant improvement over existing state-of-the-art techniques.
翻译:在监督学习中,经常面临真实不平衡数据集的情况。这种情形给标准算法的学习带来了困难。不平衡学习的研究与解决方案主要聚焦于分类任务。尽管其重要性不容忽视,但目前针对不平衡回归问题的解决方案极少。本文提出了一种基于核密度估计的数据增强算法——GOLIATH,该算法可同时应用于分类与回归任务。这种通用方法涵盖了两大类合成过采样技术:一类基于扰动(如高斯噪声),另一类基于插值(如SMOTE)。该算法还提供了这些机器学习算法的显式形式及其条件密度的表达式,特别是针对SMOTE方法。由此推导出新的合成数据生成器。我们将GOLIATH应用于不平衡回归问题,将该生成器过程与针对目标值的wild-bootstrap重采样技术相结合。我们评估了GOLIATH算法在不平衡回归场景中的性能,通过实证对比分析,证明了该方法相较于现有最优技术具有显著改进。