This paper introduces a novel worker selection algorithm, enhancing annotation quality and reducing costs in challenging span-based sequence labeling tasks in Natural Language Processing (NLP). Unlike previous studies targeting simpler tasks, this study contends with the complexities of label interdependencies in sequence labeling tasks. The proposed algorithm utilizes a Combinatorial Multi-Armed Bandit (CMAB) approach for worker selection. The challenge of dealing with imbalanced and small-scale datasets, which hinders offline simulation of worker selection, is tackled using an innovative data augmentation method termed shifting, expanding, and shrinking (SES). The SES method is designed specifically for sequence labeling tasks. Rigorous testing on CoNLL 2003 NER and Chinese OEI datasets showcased the algorithm's efficiency, with an increase in F1 score up to 100.04% of the expert-only baseline, alongside cost savings up to 65.97%. The paper also encompasses a dataset-independent test emulating annotation evaluation through a Bernoulli distribution, which still led to an impressive 97.56% F1 score of the expert baseline and 59.88% cost savings. This research addresses and overcomes numerous obstacles in worker selection for complex NLP tasks.
翻译:本文提出了一种新颖的工人选择算法,旨在提升自然语言处理(NLP)中具有挑战性的跨度序列标注任务的标注质量并降低成本。与以往针对简单任务的研究不同,本研究需应对序列标注任务中标签相互依赖的复杂性。所提算法采用组合多臂老虎机(CMAB)方法进行工人选择。针对数据不平衡与规模较小导致无法离线模拟工人选择的挑战,本文创新性地提出了一种专为序列标注任务设计的数据增强方法——平移、扩展与收缩(SES)。在CoNLL 2003命名实体识别(NER)与中文OEI数据集上的严格测试表明,该算法具有高效性:F1分数提升至仅专家标注基线的100.04%,同时成本节省高达65.97%。本文还包含一个通过伯努利分布模拟标注评估的与数据集无关的测试,该测试仍实现了专家基线的97.56%的F1分数与59.88%的成本节省。本研究解决并克服了复杂NLP任务中工人选择的诸多障碍。