Modern streaming data categorization faces significant challenges from concept drift and class imbalanced data. This negatively impacts the output of the classifier, leading to improper classification. Furthermore, other factors such as the overlapping of multiple classes limit the extent of the correctness of the output. This work proposes a novel framework for integrating data pre-processing and dynamic ensemble selection, by formulating the classification framework for the nonstationary drifting imbalanced data stream, which employs the data pre-processing and dynamic ensemble selection techniques. The proposed framework was evaluated using six artificially generated data streams with differing imbalance ratios in combination with two different types of concept drifts. Each stream is composed of 200 chunks of 500 objects described by eight features and contains five concept drifts. Seven pre-processing techniques and two dynamic ensemble selection methods were considered. According to experimental results, data pre-processing combined with Dynamic Ensemble Selection techniques significantly delivers more accuracy when dealing with imbalanced data streams.
翻译:现代流式数据分类面临概念漂移与类别不平衡数据的重大挑战,这会导致分类器输出产生负面影响,引发不当分类。此外,多类重叠等其他因素也限制了输出结果的正确性。本文提出一种融合数据预处理与动态集成选择的新框架,通过构建面向非平稳漂移非平衡数据流的分类体系,采用数据预处理与动态集成选择技术。该框架使用六组不同非平衡比的人工合成数据流(结合两种概念漂移类型)进行验证,每组数据流由200个数据块构成,每个数据块包含500个具有八维特征的样本,并包含五次概念漂移。研究考虑了七种预处理技术与两种动态集成选择方法。实验结果表明,数据预处理与动态集成选择技术的结合在处理非平衡数据流时能显著提升分类准确率。