Active learning (AL) techniques aim to maximally utilize a labeling budget by iteratively selecting instances that are most likely to improve prediction accuracy. However, their benefit compared to random sampling has not been consistent across various setups, e.g., different datasets, classifiers. In this empirical study, we examine how a combination of different factors might obscure any gains from an AL technique. Focusing on text classification, we rigorously evaluate AL techniques over around 1000 experiments that vary wrt the dataset, batch size, text representation and the classifier. We show that AL is only effective in a narrow set of circumstances. We also address the problem of using metrics that are better aligned with real world expectations. The impact of this study is in its insights for a practitioner: (a) the choice of text representation and classifier is as important as that of an AL technique, (b) choice of the right metric is critical in assessment of the latter, and, finally, (c) reported AL results must be holistically interpreted, accounting for variables other than just the query strategy.
翻译:主动学习(AL)技术旨在通过迭代选择最可能提升预测准确率的样本,从而最大化利用标注预算。然而,与随机采样相比,其优势在不同设置(如不同数据集、分类器)中并不一致。在本项实证研究中,我们考察了多种因素的组合如何可能掩盖主动学习技术带来的收益。聚焦文本分类任务,我们针对约1000个实验对主动学习技术进行了严格评估,这些实验在数据集、批次大小、文本表示和分类器方面存在差异。研究表明,主动学习仅在狭窄的特定场景下有效。我们还探讨了使用更贴合实际期望的评估指标问题。本研究对实践者具有以下启示:(a)文本表示与分类器的选择与主动学习技术的选择同等重要,(b)正确选择评估指标对评估主动学习技术至关重要,最后,(c)必须对已报道的主动学习结果进行整体性解释,需考虑除查询策略之外的其他变量。