As the size of the datasets getting larger, accurately annotating such datasets is becoming more impractical due to the expensiveness on both time and economy. Therefore, crowd-sourcing has been widely adopted to alleviate the cost of collecting labels, which also inevitably introduces label noise and eventually degrades the performance of the model. To learn from crowd-sourcing annotations, modeling the expertise of each annotator is a common but challenging paradigm, because the annotations collected by crowd-sourcing are usually highly-sparse. To alleviate this problem, we propose Coupled Confusion Correction (CCC), where two models are simultaneously trained to correct the confusion matrices learned by each other. Via bi-level optimization, the confusion matrices learned by one model can be corrected by the distilled data from the other. Moreover, we cluster the ``annotator groups'' who share similar expertise so that their confusion matrices could be corrected together. In this way, the expertise of the annotators, especially of those who provide seldom labels, could be better captured. Remarkably, we point out that the annotation sparsity not only means the average number of labels is low, but also there are always some annotators who provide very few labels, which is neglected by previous works when constructing synthetic crowd-sourcing annotations. Based on that, we propose to use Beta distribution to control the generation of the crowd-sourcing labels so that the synthetic annotations could be more consistent with the real-world ones. Extensive experiments are conducted on two types of synthetic datasets and three real-world datasets, the results of which demonstrate that CCC significantly outperforms state-of-the-art approaches.
翻译:随着数据集规模不断增大,由于时间和经济成本高昂,精确标注此类数据集变得越来越不切实际。因此,众包被广泛采用以降低收集标签的成本,但也不可避免地引入标签噪声,最终导致模型性能下降。为了从众包标注中学习,建模每个标注者的专业水平是一种常见但具有挑战性的范式,因为众包收集的标注通常高度稀疏。为解决这一问题,我们提出耦合混淆校正(CCC)方法,其中两个模型同时训练,以相互校正彼此学习的混淆矩阵。通过双层优化,一个模型学习的混淆矩阵可由另一个模型的蒸馏数据进行校正。此外,我们对具有相似专业水平的“标注者群体”进行聚类,使得它们的混淆矩阵能够被共同校正。通过这种方式,标注者(尤其是提供极少标签的标注者)的专业水平能够被更好地捕获。值得注意的是,我们指出标注稀疏性不仅指标签平均数量低,还意味着总存在一些提供极少标签的标注者——这一现象在以往构建合成众包标注时被忽视。基于此,我们提出使用贝塔分布控制众包标签的生成,使合成标注与真实世界标注更一致。在两类合成数据集和三个真实世界数据集上进行了大量实验,结果表明CCC显著优于现有最先进方法。