Supervised classification algorithms are used to solve a growing number of real-life problems around the globe. Their performance is strictly connected with the quality of labels used in training. Unfortunately, acquiring good-quality annotations for many tasks is infeasible or too expensive to be done in practice. To tackle this challenge, active learning algorithms are commonly employed to select only the most relevant data for labeling. However, this is possible only when the quality and quantity of labels acquired from experts are sufficient. Unfortunately, in many applications, a trade-off between annotating individual samples by multiple annotators to increase label quality vs. annotating new samples to increase the total number of labeled instances is necessary. In this paper, we address the issue of faulty data annotations in the context of active learning. In particular, we propose two novel annotation unification algorithms that utilize unlabeled parts of the sample space. The proposed methods require little to no intersection between samples annotated by different experts. Our experiments on four public datasets indicate the robustness and superiority of the proposed methods in both, the estimation of the annotator's reliability, and the assignment of actual labels, against the state-of-the-art algorithms and the simple majority voting.
翻译:监督分类算法被广泛应用于解决全球范围内日益增多的实际问题,其性能严格依赖于训练中使用的标签质量。然而,在许多任务中,获取高质量标注在实践层面不可行或成本过高。为应对这一挑战,常采用主动学习算法仅筛选最具相关性的数据进行标注,但前提是从专家处获取的标签数量与质量必须充足。遗憾的是,在许多应用中,需在"由多个标注者标注单个样本以提高标签质量"与"标注新样本以增加标记实例总量"之间进行权衡取舍。本文聚焦主动学习场景中的错误标注问题,特别提出两种利用样本空间中未标注部分的新型标注统一算法。所提方法对不同专家标注样本间的交集需求极低甚至为零。在四个公开数据集上的实验表明,相较于现有最优算法与简单多数投票法,我们提出的方法在标注者可靠性估计与真实标签分配两方面均展现出鲁棒性与优越性。