Severe class imbalance is common in real-world tabular learning, where rare but important minority classes are essential for reliable prediction. Existing generative oversampling methods such as GANs, VAEs, and diffusion models can improve minority-class performance, but they often struggle with tabular heterogeneity, training stability, and privacy concerns. We propose a family of latent-space, tree-driven diffusion methods for minority oversampling that use conditional flow matching with gradient-boosted trees as the vector-field learner. The models operate in compact latent spaces to preserve tabular structure and reduce computation. We introduce three variants: PCAForest, which uses linear PCA embedding; EmbedForest, which uses a learned nonlinear embedding; and AttentionForest, which uses an attention-augmented embedding. Each method couples a GBT-based flow with a decoder back to the original feature space. Across 11 datasets from healthcare, finance, and manufacturing, AttentionForest achieves the best average minority recall while maintaining competitive precision, calibration, and distributional similarity. PCAForest and EmbedForest reach similar utility with much faster generation, offering favorable accuracy-efficiency trade-offs. Privacy evaluated with nearest-neighbor distance ratio and distance-to-closest-record is comparable to or better than the ForestDiffusion baseline. Ablation studies show that smaller embeddings tend to improve minority recall, while aggressive learning rates harm stability. Overall, latent-space, tree-driven diffusion provides an efficient and privacy-aware approach to high-fidelity tabular data augmentation under severe class imbalance.
翻译:在现实世界的表格学习中,严重的类别不平衡现象普遍存在,其中稀有但重要的少数类别对于可靠预测至关重要。现有的生成式过采样方法,如生成对抗网络(GANs)、变分自编码器(VAEs)和扩散模型,可以提升少数类别的性能,但它们通常难以应对表格数据的异质性、训练稳定性以及隐私问题。我们提出了一系列基于潜在空间、树驱动的扩散方法,用于少数类别过采样,这些方法采用条件流匹配,并以梯度提升树作为向量场学习器。这些模型在紧凑的潜在空间中运行,以保持表格结构并减少计算量。我们引入了三种变体:PCAForest,使用线性主成分分析嵌入;EmbedForest,使用学习的非线性嵌入;以及AttentionForest,使用注意力增强的嵌入。每种方法都将基于梯度提升树的流与解码器耦合,以映射回原始特征空间。在来自医疗保健、金融和制造业的11个数据集上,AttentionForest在保持竞争性精度、校准和分布相似性的同时,实现了最佳的平均少数类别召回率。PCAForest和EmbedForest以更快的生成速度达到了相似的效用,提供了有利的准确性与效率权衡。通过最近邻距离比和最近记录距离评估的隐私性,与ForestDiffusion基线相当或更优。消融研究表明,较小的嵌入倾向于改善少数类别召回率,而激进的学习率会损害稳定性。总体而言,基于潜在空间、树驱动的扩散为严重类别不平衡下的高保真表格数据增强提供了一种高效且注重隐私的方法。