Active Learning (AL) aims to reduce the labeling burden by interactively selecting the most informative samples from a pool of unlabeled data. While there has been extensive research on improving AL query methods in recent years, some studies have questioned the effectiveness of AL compared to emerging paradigms such as semi-supervised (Semi-SL) and self-supervised learning (Self-SL), or a simple optimization of classifier configurations. Thus, today's AL literature presents an inconsistent and contradictory landscape, leaving practitioners uncertain about whether and how to use AL in their tasks. In this work, we make the case that this inconsistency arises from a lack of systematic and realistic evaluation of AL methods. Specifically, we identify five key pitfalls in the current literature that reflect the delicate considerations required for AL evaluation. Further, we present an evaluation framework that overcomes these pitfalls and thus enables meaningful statements about the performance of AL methods. To demonstrate the relevance of our protocol, we present a large-scale empirical study and benchmark for image classification spanning various data sets, query methods, AL settings, and training paradigms. Our findings clarify the inconsistent picture in the literature and enable us to give hands-on recommendations for practitioners. The benchmark is hosted at https://github.com/IML-DKFZ/realistic-al .
翻译:主动学习(AL)旨在通过从无标签数据池中交互式选择最具信息量的样本,降低标注负担。尽管近年来关于改进AL查询方法的研究层出不穷,但部分研究开始质疑AL相较于半监督学习(Semi-SL)、自监督学习(Self-SL)等新兴范式以及简单优化分类器配置的有效性。由此,当前AL文献呈现出矛盾丛生的混乱格局,导致实践者难以判断在具体任务中是否应当使用AL以及如何应用。本研究指出,这种矛盾源于缺乏系统化且符合实际的AL方法评估体系。具体而言,我们识别出现有文献中反映AL评估需审慎考量的五大关键陷阱,并据此提出一种能规避这些陷阱的评估框架,从而实现对AL方法性能的有意义评估。为验证所提方案的有效性,我们开展了涵盖多种数据集、查询方法、AL设置及训练范式的大规模图像分类实证研究与基准测试。研究结果厘清了文献中的矛盾观点,并为实践者提供了可操作的建议。该基准测试平台托管于 https://github.com/IML-DKFZ/realistic-al 。