Monitoring data transfer performance is a crucial task in scientific computing networks. By predicting performance early in the communication phase, potentially sluggish transfers can be identified and selectively monitored, optimizing network usage and overall performance. A key bottleneck to improving the predictive power of machine learning (ML) models in this context is the issue of class imbalance. This project focuses on addressing the class imbalance problem to enhance the accuracy of performance predictions. In this study, we analyze and compare various augmentation strategies, including traditional oversampling methods and generative techniques. Additionally, we adjust the class imbalance ratios in training datasets to evaluate their impact on model performance. While augmentation may improve performance, as the imbalance ratio increases, the performance does not significantly improve. We conclude that even the most advanced technique, such as CTGAN, does not significantly improve over simple stratified sampling.
翻译:监测数据传输性能是科学计算网络中的关键任务。通过在通信阶段早期预测性能,可以识别潜在的缓慢传输并进行选择性监控,从而优化网络使用和整体性能。在此背景下,提升机器学习(ML)模型预测能力的一个主要瓶颈是类别不平衡问题。本项目聚焦于解决类别不平衡问题以提高性能预测的准确性。本研究分析并比较了多种数据增强策略,包括传统的过采样方法和生成式技术。此外,我们调整了训练数据集中的类别不平衡比例,以评估其对模型性能的影响。尽管数据增强可能提升性能,但随着不平衡比例的增加,性能并未显著改善。我们得出结论,即使是最先进的技术(如CTGAN),也未显著优于简单的分层采样方法。