In real-world data labeling applications, annotators often provide imperfect labels. It is thus common to employ multiple annotators to label data with some overlap between their examples. We study active learning in such settings, aiming to train an accurate classifier by collecting a dataset with the fewest total annotations. Here we propose ActiveLab, a practical method to decide what to label next that works with any classifier model and can be used in pool-based batch active learning with one or multiple annotators. ActiveLab automatically estimates when it is more informative to re-label examples vs. labeling entirely new ones. This is a key aspect of producing high quality labels and trained models within a limited annotation budget. In experiments on image and tabular data, ActiveLab reliably trains more accurate classifiers with far fewer annotations than a wide variety of popular active learning methods.
翻译:在现实世界的数据标注应用中,标注者经常提供不完善的标签。因此,通常采用多个标注者进行数据标注,且标注样例存在部分重叠。我们研究此类场景下的主动学习,旨在通过收集总标注量最少的数据集来训练准确的分类器。本文提出ActiveLab这一实用方法,用于决定下一步需标注的样本。该方法适用于任意分类器模型,并可在基于池的批量主动学习中使用,支持单个或多个标注者。ActiveLab能够自动估计何时对样例进行重新标注比标注全新样本更具信息量——这是在有限标注预算内生成高质量标签与训练模型的关键要素。在图像与表格数据的实验中,相比于多种主流主动学习方法,ActiveLab能以显著更少的标注量训练出更准确的分类器。