Supervised neural approaches are hindered by their dependence on large, meticulously annotated datasets, a requirement that is particularly cumbersome for sequential tasks. The quality of annotations tends to deteriorate with the transition from expert-based to crowd-sourced labelling. To address these challenges, we present \textbf{CAMELL} (Confidence-based Acquisition Model for Efficient self-supervised active Learning with Label validation), a pool-based active learning framework tailored for sequential multi-output problems. CAMELL possesses three core features: (1) it requires expert annotators to label only a fraction of a chosen sequence, (2) it facilitates self-supervision for the remainder of the sequence, and (3) it employs a label validation mechanism to prevent erroneous labels from contaminating the dataset and harming model performance. We evaluate CAMELL on sequential tasks, with a special emphasis on dialogue belief tracking, a task plagued by the constraints of limited and noisy datasets. Our experiments demonstrate that CAMELL outperforms the baselines in terms of efficiency. Furthermore, the data corrections suggested by our method contribute to an overall improvement in the quality of the resulting datasets.
翻译:监督式神经方法受限于对大规模、精细标注数据集的需求,这一要求在序列任务中尤为繁琐。从专家标注转向众包标注后,标注质量往往会下降。为解决这些挑战,我们提出了 **CAMELL**(基于置信度的获取模型,用于带标签验证的高效自监督主动学习),这是一个专为序列多输出问题设计的基于池的主动学习框架。CAMELL具有三个核心特性:(1)它仅需专家标注者标记选定序列中的部分样本;(2)它支持对序列剩余部分进行自监督学习;(3)它采用标签验证机制,防止错误标签污染数据集并损害模型性能。我们在序列任务上评估了CAMELL,特别关注对话信念跟踪任务——该任务长期受限于数据量不足和噪声干扰。实验表明,CAMELL在效率上优于基线方法。此外,我们方法所提出的数据修正有助于整体提升生成数据集的最终质量。