Imbalanced data affects a wide range of machine learning applications, from healthcare to network security. As SMOTE is one of the most popular approaches to addressing this issue, it is imperative to validate it not only empirically but also theoretically. In this paper, we provide a rigorous theoretical analysis of SMOTE's convergence properties. Concretely, we prove that the synthetic random variable Z converges in probability to the underlying random variable X. We further prove a stronger convergence in mean when X is compact. Finally, we show that lower values of the nearest neighbor rank lead to faster convergence offering actionable guidance to practitioners. The theoretical results are supported by numerical experiments using both real-life and synthetic data. Our work provides a foundational understanding that enhances data augmentation techniques beyond imbalanced data scenarios.
翻译:类别不平衡数据广泛影响着从医疗健康到网络安全等众多机器学习应用。作为解决该问题的流行方法之一,SMOTE不仅需要在实证层面进行验证,其理论层面的严谨性亦至关重要。本文对SMOTE的收敛特性进行了严格的理论分析。具体而言,我们证明了合成随机变量Z依概率收敛于底层随机变量X。进一步地,当X为紧集时,我们证明了更强的均值收敛性。最后,我们通过理论推导表明:较小的最近邻排序值会带来更快的收敛速度,这为实践者提供了可操作的指导。基于真实数据与合成数据的数值实验验证了理论结果的正确性。本研究成果不仅深化了对不平衡数据处理机制的理解,也为数据增强技术的拓展应用奠定了理论基础。