A survey on online active learning

Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in the context of online active learning. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research. Our review aims to provide a comprehensive and up-to-date overview of the field and to highlight directions for future work.

翻译：在线主动学习是机器学习中的一种范式，旨在从数据流中选择最具信息量的数据点进行标注。近年来，特别是在实际应用中数据仅以未标注形式可得的情况下，最小化收集标注观测值相关成本的问题引起了广泛关注。标注每个观测值可能耗时且昂贵，使得获取大量标注数据变得困难。为克服这一问题，过去几十年间提出了许多主动学习策略，旨在选择最具信息量的观测值进行标注，以提升机器学习模型的性能。这些方法大致可分为两类：静态池基主动学习和流基主动学习。池基主动学习涉及从未标注数据的封闭池中选择观测值子集，这已成为众多综述和文献回顾的重点。然而，数据流可用性的日益增长导致了专注于在线主动学习的方法数量增加，后者涉及随着数据流中的观测值不断到达而持续选择和标注它们。本文旨在概述在线主动学习背景下，从数据流中选择最具信息量观测值的最新方法。我们回顾了已提出的各种技术，讨论了它们的优势与局限性，以及该研究领域存在的挑战与机遇。本综述旨在提供该领域全面且最新的概述，并强调未来工作的方向。