Crowdsourcing systems have been used to accumulate massive amounts of labeled data for applications such as computer vision and natural language processing. However, because crowdsourced labeling is inherently dynamic and uncertain, developing a technique that can work in most situations is extremely challenging. In this paper, we introduce Crowd-Certain, a novel approach for label aggregation in crowdsourced and ensemble learning classification tasks that offers improved performance and computational efficiency for different numbers of annotators and a variety of datasets. The proposed method uses the consistency of the annotators versus a trained classifier to determine a reliability score for each annotator. Furthermore, Crowd-Certain leverages predicted probabilities, enabling the reuse of trained classifiers on future sample data, thereby eliminating the need for recurrent simulation processes inherent in existing methods. We extensively evaluated our approach against ten existing techniques across ten different datasets, each labeled by varying numbers of annotators. The findings demonstrate that Crowd-Certain outperforms the existing methods (Tao, Sheng, KOS, MACE, MajorityVote, MMSR, Wawa, Zero-Based Skill, GLAD, and Dawid Skene), in nearly all scenarios, delivering higher average accuracy, F1 scores, and AUC rates. Additionally, we introduce a variation of two existing confidence score measurement techniques. Finally we evaluate these two confidence score techniques using two evaluation metrics: Expected Calibration Error (ECE) and Brier Score Loss. Our results show that Crowd-Certain achieves higher Brier Score, and lower ECE across the majority of the examined datasets, suggesting better calibrated results.
翻译:摘要:众包系统已被广泛用于积累计算机视觉和自然语言处理等应用所需的大规模标注数据。然而,由于众包标注具有内在的动态性和不确定性,开发一种能适用于大多数场景的技术极具挑战性。本文提出Crowd-Certain,一种针对众包与集成学习分类任务的新型标签聚合方法,能够在不同数量的标注者和各类数据集上实现更优性能与计算效率。该方法通过衡量标注者与训练分类器之间的一致性来为每位标注者计算可靠性分数。此外,Crowd-Certain利用预测概率,使得训练后的分类器可复用于未来样本数据,从而避免了现有方法中固有的重复模拟过程。我们使用十种现有技术对十个不同数据集(每种数据集由不同数量的标注者标注)进行了广泛评估。结果表明,在几乎所有场景中,Crowd-Certain均优于现有方法(Tao、Sheng、KOS、MACE、MajorityVote、MMSR、Wawa、Zero-Based Skill、GLAD和Dawid Skene),并实现了更高的平均准确率、F1分数和AUC率。此外,我们引入了两种现有置信度评分技术的变体,并采用期望校准误差(ECE)和Brier分数损失两种评估指标对这两种置信度评分技术进行了评估。结果表明,在大多数数据集上,Crowd-Certain获得了更高的Brier分数和更低的ECE,这表明其校准结果更优。