Learning from Data Streams: An Overview and Update

The literature on machine learning in the context of data streams is vast and growing. However, many of the defining assumptions regarding data-stream learning tasks are too strong to hold in practice, or are even contradictory such that they cannot be met in the contexts of supervised learning. Algorithms are chosen and designed based on criteria which are often not clearly stated, for problem settings not clearly defined, tested in unrealistic settings, and/or in isolation from related approaches in the wider literature. This puts into question the potential for real-world impact of many approaches conceived in such contexts, and risks propagating a misguided research focus. We propose to tackle these issues by reformulating the fundamental definitions and settings of supervised data-stream learning with regard to contemporary considerations of concept drift and temporal dependence; and we take a fresh look at what constitutes a supervised data-stream learning task, and a reconsideration of algorithms that may be applied to tackle such tasks. Through and in reflection of this formulation and overview, helped by an informal survey of industrial players dealing with real-world data streams, we provide recommendations. Our main emphasis is that learning from data streams does not impose a single-pass or online-learning approach, or any particular learning regime; and any constraints on memory and time are not specific to streaming. Meanwhile, there exist established techniques for dealing with temporal dependence and concept drift, in other areas of the literature. For the data streams community, we thus encourage a shift in research focus, from dealing with often-artificial constraints and assumptions on the learning mode, to issues such as robustness, privacy, and interpretability which are increasingly relevant to learning in data streams in academic and industrial settings.

翻译：关于数据流背景下机器学习的研究文献庞杂且不断增长。然而，许多关于数据流学习任务的定义性假设在实践中过于严格而难以成立，甚至存在矛盾，以至于在监督学习情境中无法满足。算法的选择与设计往往基于未明确陈述的标准，针对定义不清的问题设定，在非现实环境中测试，并且/或孤立于更广泛文献中的相关方法。这令人质疑在此类情境下构思的众多方法在现实世界中产生影响的潜力，并有传播误导性研究焦点的风险。我们提出通过重新审视概念漂移和时间依赖性等当代考量，重新表述监督式数据流学习的基本定义与设定来解决这些问题；同时，我们重新审视什么构成了监督式数据流学习任务，并重新思考可用于处理此类任务的算法。通过并反思这一表述与综述，借助一项针对处理真实世界数据流的行业从业者的非正式调查，我们提供相关建议。我们的主要强调点是：从数据流中学习并不强制要求单遍或在线学习方法，也不限定任何特定的学习模式；而对内存和时间的任何约束并非数据流所特有。与此同时，文献的其他领域已存在处理时间依赖性和概念漂移的成熟技术。因此，对于数据流社区，我们鼓励将研究焦点从应对学习模式上常常人为的约束和假设，转向鲁棒性、隐私性和可解释性等日益与学术及工业环境中数据流学习相关的问题。