Active Learning has received significant attention in the field of machine learning for its potential in selecting the most informative samples for labeling, thereby reducing data annotation costs. However, we show that the reported lifts in recent literature generalize poorly to other domains leading to an inconclusive landscape in Active Learning research. Furthermore, we highlight overlooked problems for reproducing AL experiments that can lead to unfair comparisons and increased variance in the results. This paper addresses these issues by providing an Active Learning framework for a fair comparison of algorithms across different tasks and domains, as well as a fast and performant oracle algorithm for evaluation. To the best of our knowledge, we propose the first AL benchmark that tests algorithms in 3 major domains: Tabular, Image, and Text. We report empirical results for 6 widely used algorithms on 7 real-world and 2 synthetic datasets and aggregate them into a domain-specific ranking of AL algorithms.
翻译:主动学习因其能够选择最具有信息量的样本进行标注,从而降低数据标注成本的潜力,在机器学习领域受到了广泛关注。然而,我们表明,近年来文献中报告的性能提升在不同领域间泛化能力较差,导致主动学习研究呈现出不确定的格局。此外,我们强调了在复现主动学习实验时被忽视的问题,这些问题可能导致不公平的比较和结果方差的增加。本文通过提供一个跨不同任务和领域公平比较算法的主动学习框架,以及一种用于评估的快速且性能优越的预言机算法,来解决这些问题。据我们所知,我们提出了第一个在表格、图像和文本三大主要领域测试算法的主动学习基准。我们报告了6种广泛使用的算法在7个真实数据集和2个合成数据集上的实证结果,并将它们汇总为特定领域的主动学习算法排名。