Active learning (AL) techniques optimally utilize a labeling budget by iteratively selecting instances that are most valuable for learning. However, they lack ``prerequisite checks'', i.e., there are no prescribed criteria to pick an AL algorithm best suited for a dataset. A practitioner must pick a technique they \emph{trust} would beat random sampling, based on prior reported results, and hope that it is resilient to the many variables in their environment: dataset, labeling budget and prediction pipelines. The important questions then are: how often on average, do we expect any AL technique to reliably beat the computationally cheap and easy-to-implement strategy of random sampling? Does it at least make sense to use AL in an ``Always ON'' mode in a prediction pipeline, so that while it might not always help, it never under-performs random sampling? How much of a role does the prediction pipeline play in AL's success? We examine these questions in detail for the task of text classification using pre-trained representations, which are ubiquitous today. Our primary contribution here is a rigorous evaluation of AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers. This unlocks multiple insights around warm-up times, i.e., number of labels before gains from AL are seen, viability of an ``Always ON'' mode and the relative significance of different factors. Additionally, we release a framework for rigorous benchmarking of AL techniques for text classification.
翻译:主动学习(AL)技术通过迭代选择对学习最有价值的实例来优化标注预算的使用。然而,它们缺乏“先决条件检查”,即没有规定性的标准来为特定数据集选择最合适的AL算法。实践者必须基于先前报告的结果,选择一种他们*信任*能够击败随机采样的技术,并希望该技术对其环境中的诸多变量(数据集、标注预算和预测流水线)具有鲁棒性。由此产生的重要问题是:平均而言,我们预期任何AL技术能可靠地击败计算成本低廉且易于实现的随机采样策略的频率有多高?在预测流水线中以“始终开启”模式使用AL是否至少有意义,从而使其虽不总是有效,但绝不会表现逊于随机采样?预测流水线在AL的成功中扮演多大的角色?我们针对使用预训练表示的文本分类任务(这在当今已无处不在)详细研究了这些问题。本文的主要贡献在于对传统及新兴AL技术进行了严格评估,覆盖了不同数据集、文本表示和分类器的多种实验设置。这揭示了关于预热时间(即观察到AL增益之前所需的标注数量)、“始终开启”模式的可行性以及不同因素相对重要性的多重洞见。此外,我们发布了一个用于文本分类AL技术严格基准测试的框架。