Learning from non-stationary data streams is a research direction that gains increasing interest as more data in form of streams becomes available, for example from social media, smartphones, or industrial process monitoring. Most approaches assume that the ground truth of the samples becomes available (possibly with some delay) and perform supervised online learning in the test-then-train scheme. While this assumption might be valid in some scenarios, it does not apply to all settings. In this work, we focus on scarcely labeled data streams and explore the potential of self-labeling in gradually drifting data streams. We formalize this setup and propose a novel online $k$-nn classifier that combines self-labeling and demand-based active learning.
翻译:随着社交媒体、智能手机及工业过程监控等领域以数据流形式产生的数据日益增多,针对非平稳数据流的学习已成为备受关注的研究方向。现有方法大多假设样本的标注真值可在一定延迟后获取,并采用"先测试后训练"的监督在线学习范式。尽管该假设在部分场景中成立,但并非适用于所有情况。本工作聚焦于标签稀疏的数据流,探索在渐进式概念漂移数据流中采用自标注方法的潜力。我们形式化定义了该学习框架,并提出一种新型在线$k$-近邻分类器,该分类器融合了自标注机制与需求式主动学习策略。