Crowdsourcing has emerged as an alternative solution for collecting large scale labels. However, the majority of recruited workers are not domain experts, so their contributed labels could be noisy. In this paper, we propose a two-stage model to predict the true labels for multicategory classification tasks in crowdsourcing. In the first stage, we fit the observed labels with a latent factor model and incorporate subgroup structures for both tasks and workers through a multi-centroid grouping penalty. Group-specific rotations are introduced to align workers with different task categories to solve multicategory crowdsourcing tasks. In the second stage, we propose a concordance-based approach to identify high-quality worker subgroups who are relied upon to assign labels to tasks. In theory, we show the estimation consistency of the latent factors and the prediction consistency of the proposed method. The simulation studies show that the proposed method outperforms the existing competitive methods, assuming the subgroup structures within tasks and workers. We also demonstrate the application of the proposed method to real world problems and show its superiority.
翻译:众包已成为收集大规模标签的替代解决方案。然而,大多数招募的工人并非领域专家,因此他们贡献的标签可能存在噪声。在本文中,我们提出一个两阶段模型,用于预测众包中多类别分类任务的真实标签。第一阶段,我们通过潜在因子模型拟合观测标签,并通过多质心分组惩罚引入任务和工人的子组结构。引入组特定旋转以将不同任务类别的工人对齐,从而解决多类别众包任务。第二阶段,我们提出基于一致性的方法,识别依赖其分配任务标签的高质量工人子组。理论上,我们证明了潜在因子估计的一致性及所提方法预测的一致性。仿真研究表明,在假设任务和工人中存在子组结构的情况下,所提方法优于现有竞争方法。我们还展示了所提方法在现实问题中的应用,并证明了其优越性。