Synthetic Augmentation in Imbalanced Learning: When It Helps, When It Hurts, and How Much to Add

Imbalanced classification, where one class is observed far less frequently than the other, often causes standard training procedures to prioritize the majority class and perform poorly on rare but important cases. A classic and widely used remedy is to augment the minority class with synthetic examples, but two basic questions remain under-resolved: when does synthetic augmentation actually help, and how many synthetic samples should be generated? We develop a unified statistical framework for synthetic augmentation in imbalanced learning, studying models trained on imbalanced data augmented with synthetic minority samples and evaluated under the balanced population risk. Our theory shows that synthetic data is not always beneficial. In a ``local symmetry" regime, imbalance is not the dominant source of error near the balanced optimum, so adding synthetic samples cannot improve learning rates and can even degrade performance by amplifying generator mismatch. When augmentation can help (a ``local asymmetry" regime), the optimal synthetic size depends on generator accuracy and on whether the generator's residual mismatch is directionally aligned with the intrinsic majority-minority shift. This structure can make the best synthetic size deviate from naive full balancing, sometimes by a small refinement and sometimes substantially when generator bias is systematic. Practically, we recommend Validation-Tuned Synthetic Size (VTSS): select the synthetic size by minimizing balanced validation loss over a range centered near the fully balanced baseline, while allowing meaningful departures when the data indicate them. Simulations and a real sepsis prediction study support the theory and illustrate when synthetic augmentation helps, when it cannot, and how to tune its quantity effectively.

翻译：非平衡分类问题中，某一类别的观测频率远低于其他类别，这通常导致标准训练过程偏向多数类，而在稀有但重要的案例上表现不佳。经典且广泛应用的补救措施是通过合成样本增强少数类，但两个基本问题仍未得到充分解决：合成增强何时真正有效？应生成多少合成样本？我们为不平衡学习中的合成增强建立了一个统一的统计框架，研究在非平衡数据基础上通过合成少数类样本增强训练的模型，并在平衡总体风险下进行评估。理论表明，合成数据并非总是有益的。在“局部对称”机制下，不平衡性并非平衡最优解附近误差的主要来源，因此添加合成样本无法改善学习速率，甚至可能因放大生成器失配而降低性能。当增强确实有效时（“局部非对称”机制），最优合成规模取决于生成器精度，以及生成器残差失配是否与内在的多数类-少数类偏移方向一致。这种结构可能导致最佳合成规模偏离朴素完全平衡策略，有时表现为微小调整，而在生成器偏差呈现系统性时则可能出现显著偏离。实践中，我们推荐验证集调优合成规模（VTSS）：通过在完全平衡基线附近区间内最小化平衡验证损失来选择合成规模，同时允许数据指示时进行有意义的偏离。仿真实验和真实的脓毒症预测研究支持了该理论，并阐明了合成增强何时有效、何时无效以及如何有效调整其数量。