As the size of the datasets getting larger, accurately annotating such datasets is becoming more impractical due to the expensiveness on both time and economy. Therefore, crowd-sourcing has been widely adopted to alleviate the cost of collecting labels, which also inevitably introduces label noise and eventually degrades the performance of the model. To learn from crowd-sourcing annotations, modeling the expertise of each annotator is a common but challenging paradigm, because the annotations collected by crowd-sourcing are usually highly-sparse. To alleviate this problem, we propose Coupled Confusion Correction (CCC), where two models are simultaneously trained to correct the confusion matrices learned by each other. Via bi-level optimization, the confusion matrices learned by one model can be corrected by the distilled data from the other. Moreover, we cluster the ``annotator groups'' who share similar expertise so that their confusion matrices could be corrected together. In this way, the expertise of the annotators, especially of those who provide seldom labels, could be better captured. Remarkably, we point out that the annotation sparsity not only means the average number of labels is low, but also there are always some annotators who provide very few labels, which is neglected by previous works when constructing synthetic crowd-sourcing annotations. Based on that, we propose to use Beta distribution to control the generation of the crowd-sourcing labels so that the synthetic annotations could be more consistent with the real-world ones. Extensive experiments are conducted on two types of synthetic datasets and three real-world datasets, the results of which demonstrate that CCC significantly outperforms state-of-the-art approaches. Source codes are available at: https://github.com/Hansong-Zhang/CCC.
翻译:随着数据集规模不断增大,由于时间和经济成本的高昂,精确标注这些数据集变得愈发不切实际。因此,众包被广泛采用以降低标签收集成本,但这不可避免地引入了标签噪声,最终导致模型性能下降。为了从众包标注中学习,建模每位标注者的专业水平是常见但具有挑战性的范式,因为众包收集的标注通常高度稀疏。为缓解这一问题,我们提出耦合混淆校正(CCC),该方法同时训练两个模型,通过相互校正对方学习的混淆矩阵。通过双层优化,一个模型学习的混淆矩阵可通过另一模型蒸馏的数据得到校正。此外,我们对具有相似专业水平的"标注者群组"进行聚类,使得其混淆矩阵能够被共同校正。通过这种方式,标注者(尤其是那些提供极少标签的标注者)的专业水平能被更好地捕捉。值得注意的是,我们指出标注稀疏性不仅指平均标签数量低,还意味着总存在一些提供极少标签的标注者——先前工作在构建合成众包标注时忽略了这一现象。基于此,我们提出使用Beta分布控制众包标签的生成,使得合成标注与真实世界标注更加一致。我们在两类合成数据集和三个真实世界数据集上进行了大量实验,结果表明CCC显著优于现有最先进方法。源代码可见于:https://github.com/Hansong-Zhang/CCC。