Many real-world data stream applications not only suffer from concept drift but also class imbalance. Yet, very few existing studies investigated this joint challenge. Data difficulty factors, which have been shown to be key challenges in class imbalanced data streams, are not taken into account by existing approaches when learning class imbalanced data streams. In this work, we propose a drift adaptable oversampling strategy to synthesise minority class examples based on stream clustering. The motivation is that stream clustering methods continuously update themselves to reflect the characteristics of the current underlying concept, including data difficulty factors. This nature can potentially be used to compress past information without caching data in the memory explicitly. Based on the compressed information, synthetic examples can be created within the region that recently generated new minority class examples. Experiments with artificial and real-world data streams show that the proposed approach can handle concept drift involving different minority class decomposition better than existing approaches, especially when the data stream is severely class imbalanced and presenting high proportions of safe and borderline minority class examples.
翻译:许多现实世界的数据流应用不仅面临概念漂移问题,还遭受类别不平衡的困扰。然而,现有研究极少针对这一双重挑战展开探究。目前的方法在学习类别不平衡数据流时,并未考虑已被证明是类别不平衡数据流关键难题的数据困难因素。本文提出一种基于流聚类的漂移自适应过采样策略来合成少数类样本。其核心理念在于:流聚类方法能够持续自我更新以反映当前底层概念的特征(包括数据困难因素)。该特性可有效压缩历史信息,无需在内存中显式缓存数据。基于压缩后的信息,可在近期生成新少数类样本的区域内合成示例。人工与真实数据流的实验表明,该方法在处理涉及不同少数类分解模式的概念漂移时优于现有方法,尤其在数据流严重类别不平衡且呈现高比例安全与边界少数类样本的场景下优势更为显著。