Class imbalance refers to a situation where certain classes in a dataset have significantly fewer samples than oth- ers, leading to biased model performance. Class imbalance in network intrusion detection using Tabular Denoising Diffusion Probability Models (TabDDPM) for data augmentation is ad- dressed in this paper. Our approach synthesizes high-fidelity minority-class samples from the CIC-IDS2017 dataset through iterative denoising processes. For the minority classes that have smaller samples, synthetic samples were generated and merged with the original dataset. The augmented training data enables an ANN classifier to achieve near-perfect recall on previously underrepresented attack classes. These results establish diffusion models as an effective solution for tabular data imbalance in security domains, with potential applications in fraud detection and medical diagnostics.
翻译:类别不平衡指数据集中某些类别的样本数量显著少于其他类别,导致模型性能出现偏差。本文针对使用表格去噪扩散概率模型(TabDDPM)进行数据增强的网络入侵检测中的类别不平衡问题展开研究。我们的方法通过迭代去噪过程,从CIC-IDS2017数据集中合成高保真度的少数类样本。针对样本量较少的少数类别,我们生成了合成样本并将其与原始数据集合并。增强后的训练数据使ANN分类器在先前代表性不足的攻击类别上实现了近乎完美的召回率。这些结果确立了扩散模型作为安全领域表格数据不平衡问题的有效解决方案,在欺诈检测和医疗诊断领域具有潜在应用价值。