Imbalanced classification often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic samples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples. Our theory shows that synthetic data is not always beneficial. In a "local symmetry" regime, imbalance is not the dominant source of error, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help ("local asymmetry"), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures. Extensive simulations and real data analysis further support our findings.
翻译:不平衡分类常导致标准训练过程偏向多数类,在稀有但重要的案例上表现不佳。一种经典且广泛使用的补救措施是通过合成样本增强少数类,但两个基本问题仍未得到充分解决:合成增强何时真正有效?应生成多少合成样本?我们为不平衡学习中的合成增强建立了一个统一的统计框架,研究在添加了少数类合成样本的不平衡数据上训练的模型。我们的理论表明,合成数据并非总是有益的。在“局部对称”机制下,不平衡并非误差的主要来源,因此添加合成样本无法改善学习速率,甚至可能因放大生成器失配而降低性能。当增强能够发挥作用时(“局部不对称”),最优合成数量取决于生成器精度,以及生成器的残余失配是否与固有的多数类-少数类偏移方向一致。这种结构可能导致最佳合成数量偏离朴素的全平衡基准。在实践中,我们推荐验证调优合成数量(VTSS):通过在全平衡基线附近范围内最小化平衡验证损失来选择合成数量,同时允许有意义的偏离。大量模拟和真实数据分析进一步支持了我们的发现。