Annotation quality and quantity positively affect the learning performance of sequence labeling, a vital task in Natural Language Processing. Hiring domain experts to annotate a corpus is very costly in terms of money and time. Crowdsourcing platforms, such as Amazon Mechanical Turk (AMT), have been deployed to assist in this purpose. However, the annotations collected this way are prone to human errors due to the lack of expertise of the crowd workers. Existing literature in annotation aggregation assumes that annotations are independent and thus faces challenges when handling the sequential label aggregation tasks with complex dependencies. To conquer the challenges, we propose an optimization-based method that infers the ground truth labels using annotations provided by workers for sequential labeling tasks. The proposed Aggregation method for Sequential Labels from Crowds ($AggSLC$) jointly considers the characteristics of sequential labeling tasks, workers' reliabilities, and advanced machine learning techniques. Theoretical analysis on the algorithm's convergence further demonstrates that the proposed $AggSLC$ halts after a finite number of iterations. We evaluate $AggSLC$ on different crowdsourced datasets for Named Entity Recognition (NER) tasks and Information Extraction tasks in biomedical (PICO), as well as a simulated dataset. Our results show that the proposed method outperforms the state-of-the-art aggregation methods. To achieve insights into the framework, we study the effectiveness of $AggSLC$'s components through ablation studies.
翻译:标注质量和数量对自然语言处理中序列标注任务的学习性能具有正向影响。雇佣领域专家进行语料标注在时间和金钱上成本高昂。亚马逊土耳其机器人(AMT)等众包平台已被用于辅助完成该任务,但这种方式收集的标注因众包工作者缺乏专业知识而容易产生人为错误。现有标注聚合文献假设标注相互独立,因此在处理具有复杂依赖关系的序列标注聚合任务时面临挑战。为应对这些挑战,我们提出一种基于优化的方法,通过工作者为序列标注任务提供的标注推断真实标签。所提出的众包序列标注聚合方法($AggSLC$)联合考虑了序列标注任务特性、工作者可靠性及先进机器学习技术。算法收敛性的理论分析进一步证明,$AggSLC$在有限次迭代后终止。我们在命名实体识别(NER)任务、生物医学信息抽取任务(PICO)的不同众包数据集以及模拟数据集上评估了$AggSLC$。结果表明,所提方法优于最先进的聚合方法。为深入理解框架,我们通过消融研究分析了$AggSLC$各组件的有效性。