This study examines the impact of class-imbalanced data on deep learning models and proposes a technique for data balancing by generating synthetic data for the minority class. Unlike random-based oversampling, our method prioritizes balancing the informative regions by identifying high entropy samples. Generating well-placed synthetic data can enhance machine learning algorithms accuracy and efficiency, whereas poorly-placed ones may lead to higher misclassification rates. We introduce an algorithm that maximizes the probability of generating a synthetic sample in the correct region of its class by optimizing the class posterior ratio. Additionally, to maintain data topology, synthetic data are generated within each minority sample's neighborhood. Our experimental results on forty-one datasets demonstrate the superior performance of our technique in enhancing deep-learning models.
翻译:本研究探讨了类别不平衡数据对深度学习模型的影响,并提出了一种通过对少数类生成合成数据来实现数据平衡的技术。与基于随机的过采样方法不同,我们的方法通过识别高熵样本优先关注信息丰富区域的平衡。合理放置的合成数据能够提升机器学习算法的准确性和效率,而放置不当则可能导致更高的误分类率。我们提出了一种算法,通过优化类别后验比,最大化在正确类别区域内生成合成样本的概率。此外,为保持数据拓扑结构,合成数据在每个少数类样本的邻域内生成。在四十一个数据集上的实验结果表明,我们的技术在增强深度学习模型方面具有优越性能。